
0.1.0 About Introduction to Python¶
Introduction to Python is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.
The structure of this course is a code-along style so it is 100% hands on! A few hours prior to each lecture, the materials will be avaialable for download at QUERCUS. The teaching materials will consist of a Jupyter Lab Notebook with concepts, comments, instructions, and blank spaces that you will fill out with Python code along with the instructor. Other teaching materials include a live version of the notebook, and datasets to import into Python - when required. This learning approach will allow you to spend the time coding and not taking notes!
As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark) through DataCamp to help cement and/or extend what you learn each week.
0.1.1 Where is this course headed?¶
We'll take a blank slate approach here to Python and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to get you from some potential scenarios:
A pile of data (like an excel file or tab-separated file) full of experimental observations and you don't know what to do with it.
Maybe you're manipulating large tables all in excel, making custom formulas and pivot table with graphs. Now you have to repeat similar experiments and do the analysis again.
You're generating high-throughput data and there aren't any bioinformaticians around to help you sort it out.
You heard about Python and what it could do for your data analysis but don't know what that means or where to start.
and get you to a point where you can:
Format your data correctly for analysis
Produce basic plots and perform exploratory analysis
Make functions and scripts for re-analysing existing or new data sets
Track your experiments in a digital notebook like Jupyter!
0.1.2 How do we get there? Step-by-step.¶
In the first two lessons, we will talk about the basic data structures and objects in Python, get cozy with the Jupyter Notebook environment, and learn how to get help when you are stuck. Because everyone gets stuck - a lot! Then you will learn how to get your data in and out of Python, how to tidy our data (data wrangling), subset and merge data. We'll take a break from data wrangling to spend our fourth lecture learning how to generate exploratory data plots. Then we'll slide into using the power of python and programming to use flow control before visiting text manipulation techniques in lectures 5 and 6. Data cleaning and string manipulation is really the battleground of coding - getting your data into the format where you can analyse it. Lastly, we will learn to write customized functions to help scale up your analyses.

The structure of the class is a code-along style: It is fully hands on. At the end of each lecture, the complete notes will be made available in a PDF format through the corresponding Quercus module so you don't have to spend your attention on taking notes.
0.1.3 What kind of coding style will we learn?¶
There is no single path correct from A to B - although some paths may be more elegant, or more efficient than others. With that in mind, the emphasis in this lecture series will be on:
- Code simplicity - learn helpful functions that allow you to focus on understanding the basic tenets of good data wrangling (reformatting) to facilitate quick exploratory data analysis and visualization.
- Code readability - format and comment your code for yourself and others so that even those with minimal experience in R will be able to quickly grasp the overall steps in your code.
- Code stability - while the core Python code is relatively stable, behaviours of functions can still change with updates. There are well-developed packages we'll focus on for our analyses. Namely, we'll become more familiar with the
pandasseries of packages for working with tabular data. This resource is well-maintained by a large community of developers. While not always the "fastest" approach this additional layer can help ensure your code still runs (somewhat) smoothly later down the road.
0.2.0 Lecture objectives¶
Welcome to this fourth lecture in a series of seven. Today we will pick up where we left off last week with our merged data. We'll learn how to explore the data, summarize it, and plot it!
At the end of this lecture we will aim to have covered the following topics:
- Exploratory data analysis with long-format data
- Data plotting with the
matplotlib.pyplotpackage. - Advanced visualizations with the
seabornpackage. - Saving visualizations to file.
0.3.0 A legend for text format in Jupyter markdown¶
grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink
... - Within each coding cell this will indicate an area of code that students will need to complete for the code cell to run correctly.
0.4.0 Lecture and data files used in this course¶
0.4.1 Weekly Lecture and skeleton files¶
Each week, new lesson files will appear within your JupyterHub folders. We are pulling from a GitHub repository using this Repository git-pull link. Simply click on the link and it will take you to the University of Toronto JupyterHub. You will need to use your UTORid credentials to complete the login process. From there you will find each week's lecture files in the directory /2025-01-IntroPython/Lecture_XX. You will find a partially coded skeleton.ipynb file as well as all of the data files necessary to run the week's lecture.
Alternatively, you can download the Jupyter Notebook (.ipynb) and data files from JupyterHub to your personal computer if you would like to run independently of the JupyterHub.
0.4.2 Live-coding HTML page¶
A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!
0.4.3 Post-lecture PDFs¶
As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF file under the Modules section of Quercus.
0.4.4 Microsporidia infection data set description¶
The following datasets used in this week's class come from a published manuscript on PLoS Pathogens entitled "High-throughput phenotyping of infection by diverse microsporidia species reveals a wild C. elegans strain with opposing resistance and susceptibility traits" by Mok et al., 2023. These datasets focus on the an analysis of infection in wild isolate strains of the nematode C. elegans by environmental pathogens known as microsporidia. The authors collected embryo counts from individual animals in the population after population-wide infection by microsporidia and we'll spend our next few classes working with the dataset to learn how to format and manipulate it.
0.4.4.1 Dataset 1: /data/embryo_long_merged.csv¶
This is a result of our efforts (mostly) from last lecture. After transforming a wide-format version of our measurement data, we merged it with some metadata regarding our experiments and now it is ready to be visualized!
0.4.4.2 Dataset 2: /data/infection_signal.tsv¶
This is an imaging analysis of infected C. elegans strains N2 and JU1400 measuring the overall number of pixels for each animals and the number of fluorescent (infected) pixels within the same area.
0.5.0 Packages used in this lesson¶
IPython and InteractiveShell will be access just to set the behaviour we want for iPython so we can see multiple code outputs per code cell.
numpy provides a number of mathematical functions as well as the special data class of arrays which we'll be learning about today.
pandas is built upon the numpy package and extends the capabilities to work with tabular data
matplotlib is a package used to work with data in the style of the language MATPLOT
seaborn is built upon a portion of the matplotlib to make plotting data more accessible and simplify it from a layered grammar of graphics perspective.
# ----- Always run this at the beginning of class so we can get multi-command output ----- #
# Access options from the iPython core
from IPython.core.interactiveshell import InteractiveShell
# Change the value of ast_node_interactivity
InteractiveShell.ast_node_interactivity = "all"
# ----- Additional packages we want to import for class ----- #
# Import the pandas package
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
# seaborn should already be installed on Jupyter Hub
# !pip install seaborn
1.0.0 An introduction to exploratory data analysis¶
Recall from last week, we spent our lecture converting data from the a series of worm/pathogen interactions which contained observations about the formation of spores, meronts, and embryos within individual animals in an infected population. The process involved conversion from a wide-format to long-format along with the merging of this data to a set of metadata.
Now that the wrangling is completed, we can perform some exploratory data analysis (EDA). The process of EDA investigates your data to identify abnormalities, summarize its main characteristics and identify potential patterns or trends for further validation. We did some initial statistical summarization on the numerical and non-numerical data but today we'll dig deeper using some additional tools in our Python pockets.
With our EDA today we will try to answer questions like:
- Which worm strain or pathogen strain is most often measured/observed?
- What is the mean number of embryos produced by uninfected animals in our study?
- How variable is that mean across different sets of replicate experiments?
- How many pathogen strains have been tested on each worm strain?
1.0.1 Import our data from last week¶
Let's open a version of our dataset from last week. We'll find it in the file data/embryo_long_merged.csv. Recall that it is always good practice to explore your data to find out more about aspects such as probability distributions, outliers, and central tendency/dispersion measures.
Now we are going to do some exploratory data analysis (EDA) on embryo_long_merged.csv which we made last class.
# Read in embryo_long_merged.csv
embryo_merged = pd.read_csv('data/embryo_long_merged.csv')
# Take a peek at the data
embryo_merged.head()
# How big is this dataset?
embryo_merged.info()
| worm.number | date | wormStrain | pathogenStrain | pathogenDose | doseLevel | timepoint | merontsPresent | sporesPresent | numEmbryos | ... | Plate Size | Spores/cm2 | Temp | infection.type | Staining Date | Stain type | Slide date | Slide number | Slide Box | Imaging Date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 190426 | AB1 | LUAm1 | 0M | Mock | 72hpi | False | False | 10 | ... | 6 | 0.0 | 21 | continuous | 190430 | DY96 | 190501 | 7 | 2 | 190502 |
| 1 | 2 | 190426 | AB1 | LUAm1 | 0M | Mock | 72hpi | False | False | 9 | ... | 6 | 0.0 | 21 | continuous | 190430 | DY96 | 190501 | 7 | 2 | 190502 |
| 2 | 3 | 190426 | AB1 | LUAm1 | 0M | Mock | 72hpi | False | False | 16 | ... | 6 | 0.0 | 21 | continuous | 190430 | DY96 | 190501 | 7 | 2 | 190502 |
| 3 | 4 | 190426 | AB1 | LUAm1 | 0M | Mock | 72hpi | False | False | 13 | ... | 6 | 0.0 | 21 | continuous | 190430 | DY96 | 190501 | 7 | 2 | 190502 |
| 4 | 5 | 190426 | AB1 | LUAm1 | 0M | Mock | 72hpi | False | False | 8 | ... | 6 | 0.0 | 21 | continuous | 190430 | DY96 | 190501 | 7 | 2 | 190502 |
5 rows × 31 columns
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11149 entries, 0 to 11148 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 worm.number 11149 non-null int64 1 date 11149 non-null int64 2 wormStrain 11149 non-null object 3 pathogenStrain 11149 non-null object 4 pathogenDose 11149 non-null object 5 doseLevel 11149 non-null object 6 timepoint 11149 non-null object 7 merontsPresent 11149 non-null bool 8 sporesPresent 11149 non-null bool 9 numEmbryos 11149 non-null int64 10 experiment 11149 non-null object 11 experimenter 11149 non-null object 12 description 11149 non-null object 13 Infection Date 11149 non-null int64 14 Plate Number 11149 non-null int64 15 Total Worms 11149 non-null int64 16 Spore Lot 11149 non-null object 17 Lot concentration 11149 non-null int64 18 Total ul spore 11149 non-null float64 19 Infection Round 11149 non-null int64 20 40X OP50 (mL) 11149 non-null float64 21 Plate Size 11149 non-null int64 22 Spores/cm2 11149 non-null float64 23 Temp 11149 non-null int64 24 infection.type 11149 non-null object 25 Staining Date 11149 non-null int64 26 Stain type 11149 non-null object 27 Slide date 11149 non-null int64 28 Slide number 11149 non-null int64 29 Slide Box 11149 non-null int64 30 Imaging Date 11149 non-null int64 dtypes: bool(2), float64(3), int64(15), object(11) memory usage: 2.5+ MB
So as a reminder, we've imported a dataset that is 11,149 rows with 30 columns of data. At this point, the only measurement data exists in 3 columns: merontsPresent, sporesPresent, and numEmbryos.
Before we go any further, however, let's drop some of the extraneous metadata that we won't need for our analyses. Pretty much everything from Infection Round onwards is unnecessary. From our call to .info() we can see that we just need the first 19 columns.
# Subset the data from our dataset for the first 19 columns
embryo_merged_subset = embryo_merged.iloc[:, :19]
# What does the subset look like?
embryo_merged_subset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11149 entries, 0 to 11148 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 worm.number 11149 non-null int64 1 date 11149 non-null int64 2 wormStrain 11149 non-null object 3 pathogenStrain 11149 non-null object 4 pathogenDose 11149 non-null object 5 doseLevel 11149 non-null object 6 timepoint 11149 non-null object 7 merontsPresent 11149 non-null bool 8 sporesPresent 11149 non-null bool 9 numEmbryos 11149 non-null int64 10 experiment 11149 non-null object 11 experimenter 11149 non-null object 12 description 11149 non-null object 13 Infection Date 11149 non-null int64 14 Plate Number 11149 non-null int64 15 Total Worms 11149 non-null int64 16 Spore Lot 11149 non-null object 17 Lot concentration 11149 non-null int64 18 Total ul spore 11149 non-null float64 dtypes: bool(2), float64(1), int64(7), object(9) memory usage: 1.5+ MB
1.1.0 Which worm and pathogen strains are most often used in our experiments?¶
Let's begin with a deceptively simple question about our data. As we'll see, however, it requires more than a simple method call to our data in order to discern. To get there, we'll walk through the thought process so you can avoid potential pitfalls later in your own analyses.
1.1.1 Get a summary of our information¶
There is a lot of subgrouped data hidden with our dataset. Experiments are classified by their date, wormStrain, pathogenStrain, pathogenDose, and timepoint. The combination of all five of these can also be found in the experiment column, although that combination may not always be useful to us.
Let's begin with the .describe() method to review any numerical data that we can.
# Get a description of the numeric data
embryo_merged_subset.describe()
| worm.number | date | numEmbryos | Infection Date | Plate Number | Total Worms | Lot concentration | Total ul spore | |
|---|---|---|---|---|---|---|---|---|
| count | 11149.000000 | 11149.00000 | 11149.000000 | 11149.000000 | 11149.000000 | 11149.000000 | 11149.00000 | 11149.000000 |
| mean | 27.499148 | 197408.14378 | 9.148444 | 197405.090143 | 18.785721 | 1309.444793 | 273701.61001 | 25.916820 |
| std | 16.460217 | 4864.97377 | 7.509851 | 4864.935158 | 15.939817 | 1678.485600 | 124153.58178 | 33.928541 |
| min | 1.000000 | 190426.00000 | 0.000000 | 190423.000000 | 1.000000 | 500.000000 | 63625.00000 | 0.000000 |
| 25% | 14.000000 | 190426.00000 | 2.000000 | 190423.000000 | 7.000000 | 1000.000000 | 176000.00000 | 0.000000 |
| 50% | 27.000000 | 200714.00000 | 9.000000 | 200711.000000 | 14.000000 | 1000.000000 | 176000.00000 | 8.196721 |
| 75% | 40.000000 | 200825.00000 | 15.000000 | 200822.000000 | 25.000000 | 1000.000000 | 427000.00000 | 56.818182 |
| max | 115.000000 | 200918.00000 | 48.000000 | 200915.000000 | 63.000000 | 10000.000000 | 427000.00000 | 113.636364 |
1.1.2 Use the .describe() method to summarize non-numeric columns¶
Recall that we can also summarize our non-numeric data, to a certain extent, as long as we provide it properly to the describe() method. This method can identify the "top" occuring entry in a column as well as it's frequency. We already know that the wormStrain column contains the strain information on each individual worm measured in our data. The same goes for pathogenStrain which will contain some similar information about our pathogens used. It should be simple enough to just create a summary of that information. Let's give it a try.
# Use the describe method on the OTU column from our merged_subset
embryo_merged_subset.loc[:,['wormStrain', 'pathogenStrain']].describe()
| wormStrain | pathogenStrain | |
|---|---|---|
| count | 11149 | 11149 |
| unique | 18 | 10 |
| top | N2 | LUAm1 |
| freq | 2941 | 5030 |
At a quick glance, we can answer our first question - that the N2 animal is the most measured one in our infection studies while the pathogen LUAm1 is the most measured pathogen in our study. There are a few catches to these results, however:
Although N2 is the most-often measured individual animal, this is biased by the fact that it acts as a control strain in many experiments. If we were to look at specific groups of experiments and how often N2 was included, would it still be the most prevalent?
Our measurements might be comprised of entries where LUAm1 is a
pathogenStrain, but it'spathogenDoseis 0! So it's not really being used to infect at all.
Let's address the 2nd question and circle back around to the first in a little bit as it is slightly more complex.
1.1.3 Filter your data with conditional booleans¶
Last lecture we saw a few examples where we could subset our data using a simple conditional statement with a method like .isna(). While we referred to this as "slicing" our data, you can also think of it as a way to filter data. We can of course, filter our data using other conditional statements and before summarizing your data, you should consider the nature of your values! In our dataset it appears that the pathogenDose can denote whether or not an animal was exposed to a pathogen or was "mock-infected". A mock-infected sample would have a pathogenDose value of 0.
To solve our issues with summarizing the pathogen usage conundrum we turn to filtering our data by the pathogenDose column. We'll do this in two steps:
- Filter the dataset by a conditional.
- Pass that dataset on for summarization.
# Recall that we can broadcast a conditional query to multiple values in a DataFrame
embryo_merged_subset.loc[:, 'pathogenDose'] > 0
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[6], line 2 1 # Recall that we can broadcast a conditional query to multiple values in a DataFrame ----> 2 embryo_merged_subset.loc[:, 'pathogenDose'] > 0 File ~\anaconda3\envs\CSBJupyter\Lib\site-packages\pandas\core\ops\common.py:76, in _unpack_zerodim_and_defer.<locals>.new_method(self, other) 72 return NotImplemented 74 other = item_from_zerodim(other) ---> 76 return method(self, other) File ~\anaconda3\envs\CSBJupyter\Lib\site-packages\pandas\core\arraylike.py:56, in OpsMixin.__gt__(self, other) 54 @unpack_zerodim_and_defer("__gt__") 55 def __gt__(self, other): ---> 56 return self._cmp_method(other, operator.gt) File ~\anaconda3\envs\CSBJupyter\Lib\site-packages\pandas\core\series.py:5803, in Series._cmp_method(self, other, op) 5800 lvalues = self._values 5801 rvalues = extract_array(other, extract_numpy=True, extract_range=True) -> 5803 res_values = ops.comparison_op(lvalues, rvalues, op) 5805 return self._construct_result(res_values, name=res_name) File ~\anaconda3\envs\CSBJupyter\Lib\site-packages\pandas\core\ops\array_ops.py:346, in comparison_op(left, right, op) 343 return invalid_comparison(lvalues, rvalues, op) 345 elif lvalues.dtype == object or isinstance(rvalues, str): --> 346 res_values = comp_method_OBJECT_ARRAY(op, lvalues, rvalues) 348 else: 349 res_values = _na_arithmetic_op(lvalues, rvalues, op, is_cmp=True) File ~\anaconda3\envs\CSBJupyter\Lib\site-packages\pandas\core\ops\array_ops.py:131, in comp_method_OBJECT_ARRAY(op, x, y) 129 result = libops.vec_compare(x.ravel(), y.ravel(), op) 130 else: --> 131 result = libops.scalar_compare(x.ravel(), y, op) 132 return result.reshape(x.shape) File ops.pyx:107, in pandas._libs.ops.scalar_compare() TypeError: '>' not supported between instances of 'str' and 'int'
Oops! The pathogenDose column is a string! We need to fix that up first and convert it from values like "0M" to a float value like "0.0".
Recall we have at our disposal the following methods:
.pop(): updates theDataFrameobject and returns aSeriesobject.Series.str.split(): returns aDataFrameobject ifexpand = True.astype(): returns aDataFrameobject.insert(): returns nothing BUT updates a data frame object. Note that this method's parameters includeloc: the index position to insert at (causing it to become the variable at that position)column: the name of your new columnvalue: the values that comprise the column you are insert.allow_duplicates: boolean variable (Falseby default) which will allow you to insert a column with a label that already exists. Note that doing so can be problematic as it does not replace the pre-existing column.
We'll use those now, to fix the pathogenDose column and replace it in our embryo_merged_subset.
# Use the insert method
embryo_merged_subset.insert(loc = 4, # Location to insert at
column = 'pathogenDose', # The "new" column name to insert
# To calculate the value, we'll pop from the subset (which removes the chosen column!)
value = (embryo_merged_subset.pop('pathogenDose')
# Break up the float from the "M" which will make 2 columns
.str.split(pat = "M", expand = True)
# Convert the first column to a float, and then provide only that to insert
.astype({0:'float64'})[0]) # The [0] index slices our DataFrame returns just the first column as a Series
)
# Check on the updated subset data
embryo_merged_subset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 11149 entries, 0 to 11148 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 worm.number 11149 non-null int64 1 date 11149 non-null int64 2 wormStrain 11149 non-null object 3 pathogenStrain 11149 non-null object 4 pathogenDose 11149 non-null float64 5 doseLevel 11149 non-null object 6 timepoint 11149 non-null object 7 merontsPresent 11149 non-null bool 8 sporesPresent 11149 non-null bool 9 numEmbryos 11149 non-null int64 10 experiment 11149 non-null object 11 experimenter 11149 non-null object 12 description 11149 non-null object 13 Infection Date 11149 non-null int64 14 Plate Number 11149 non-null int64 15 Total Worms 11149 non-null int64 16 Spore Lot 11149 non-null object 17 Lot concentration 11149 non-null int64 18 Total ul spore 11149 non-null float64 dtypes: bool(2), float64(2), int64(7), object(8) memory usage: 1.5+ MB
Now check on that conditional filtering again!
# Now we can use a conditional comparison on the pathogenDose column
embryo_merged_subset.loc[:, 'pathogenDose'] > 0
0 False
1 False
2 False
3 False
4 False
...
11144 True
11145 True
11146 True
11147 True
11148 True
Name: pathogenDose, Length: 11149, dtype: bool
Now we need to summarize the data.
# Supply the result of the conditional query as a filtering criteria
print("Filtered pathogenStrain summary")
embryo_merged_subset.loc[embryo_merged_subset.loc[:, 'pathogenDose'] > 0, # remember we're filtering by rows!
# We only need the pathogenStrain column to summarize
'pathogenStrain' # No list notation means we'll get a Series back
].describe()
# Note the \n adds a blank line to our output
print("\nUnfiltered pathogenStrain summary")
# Compare that to the unfiltered subset
embryo_merged_subset.loc[:, 'pathogenStrain'].describe()
Filtered pathogenStrain summary
count 7730 unique 10 top LUAm1 freq 2838 Name: pathogenStrain, dtype: object
Unfiltered pathogenStrain summary
count 11149 unique 10 top LUAm1 freq 5030 Name: pathogenStrain, dtype: object
1.1.4 The .describe() method returns a Series object¶
A useful aspect of the .describe() method is that it also returns a Series object, which means its information can be retrieved or saved for further use! Recall we can choose from the possible index names. In the case of non-numeric summaries, we can use count, unique, top, or freq. These are also 0-indexed!
Therefore we can use the results of .describe() to further filter our data along different variables. For instance, we can now ask the question "Among the most observed pathogen used, which worm strain (host) is most observed?" Let's answer that question now!
# Filter for experiments with actual pathogen exposure
pathogen_Summary = embryo_merged_subset.loc[embryo_merged_subset.loc[:, 'pathogenDose'] > 0,
# We only need the pathogenStrain column to summarize
'pathogenStrain'].describe()
# Of the LUAm1 infections, which worm strain reigns supreme?
embryo_merged_subset.loc[(embryo_merged_subset["pathogenStrain"] == pathogen_Summary.top), # Filter rows by the "top" pathogenStrain
'wormStrain'].describe() # Keep the wormStrain column to summarize it
count 5030 unique 15 top N2 freq 864 Name: wormStrain, dtype: object
1.1.5 Know the difference between filtering and the .filter() method¶
You may be wondering to yourself, surely there must be a .filter() method implemented with the DataFrame. It seems like such an essential part of working with DataFrames. You're right that such a method exists but you would be wrong to think that it is used for conditional filtering.
The filter() method is used merely for subsetting your DataFrame by selecting on column name or by regular expression pattern (See Lecture 06). It does not use conditional logic to subset rows from your data. While this can be helpful in certain contexts, it does NOT implement the idea of filtering rows of our data based on conditional criteria.
Here's a quick example of how to use it.
# Select the wormStrain, pathogenStrain, and numEmbryos columns
embryo_merged_subset.filter(items = ['wormStrain', 'pathogenStrain', 'numEmbryos'])
| wormStrain | pathogenStrain | numEmbryos | |
|---|---|---|---|
| 0 | AB1 | LUAm1 | 10 |
| 1 | AB1 | LUAm1 | 9 |
| 2 | AB1 | LUAm1 | 16 |
| 3 | AB1 | LUAm1 | 13 |
| 4 | AB1 | LUAm1 | 8 |
| ... | ... | ... | ... |
| 11144 | N2 | ERTm5-96H | 1 |
| 11145 | N2 | ERTm5-96H | 0 |
| 11146 | N2 | ERTm5-96H | 0 |
| 11147 | N2 | ERTm5-96H | 3 |
| 11148 | N2 | ERTm5-96H | 2 |
11149 rows × 3 columns
1.1.6 Use the .query() method to conditional filter your data¶
Thus far we have seen the use of boolean logic to help produce conditional filtering by selecting on the resulting arrays with the .loc[] or .iloc[] methods. However, you might have noticed that the code is rather clunky:
embryo_merged_subset.loc[embryo_merged_subset.loc[:, 'pathogenDose'] > 0, 'pathogenStrain']
As an alternative, we can use the .query() method which has its roots in database querying language. This method has two main parameters:
expr: a string object which encapsulates the filter query you want to use. This is defined within a set of single-quotes ie'pathogenDose > 0'.- When defining column names with spaces, use the back-tick (grave accent) character (`)
- When filtering a column using a string value, surround these with double-quotes (").
- When refering to a variable in your environment prefix with the
@symbol.
inplace: a boolean on whether or not to replace theDataFrameand produce no output (True) or pass on the newDataFrame(False, default).
Let's repeat our previous filtering from section 1.1.3 using .query() and method chaining.
# Filter to see which pathogen strain is most often measured.
# We'll encapsulate the call in () so we can separate across lines
# Filter the dataset for observations with actual infection
(embryo_merged_subset.query('pathogenDose > 0')
# Select just pathogenStrain as a column
.filter(['pathogenStrain'])
# Get a summary
.describe()
)
| pathogenStrain | |
|---|---|
| count | 7730 |
| unique | 10 |
| top | LUAm1 |
| freq | 2838 |
Our query from section 1.1.4 asked "Among the most observed pathogen used, which worm strain (host) is most observed?". We created an intermediate variable pathogen_Summary to help answer that question and we can still use the .query() method to simplify the process.
# To complete our query, we use the @ to identify our intermediate variable
(embryo_merged_subset.query('pathogenStrain == @pathogen_Summary.top')
# Filter on the column we want to use
.filter(["wormStrain"])
# Summarize
.describe()
)
| wormStrain | |
|---|---|
| count | 5030 |
| unique | 15 |
| top | N2 |
| freq | 864 |
1.2.0 Which worm strain is most often included in our different experimental groups?¶
Circling back to our first question, we discovered that N2 is the worm strain that is most often measured across all of our datasets, but the dataset contains groups of different infection experiments. Is N2 the most prevalent strain when we look at its inclusion within individual experiments? How do we explore this?
1.2.1 Grouping your DataFrame using the .groupby() method¶
Thinking about our problem, we already have an identifier that breaks down each experimental grouping - date. Each different date essentially sets up a different experimental replicate. We could use our newly-taught filtering techniques, but we would have to cycle through each potential value and summarize the subset. (Honestly this would have been my approach when first learning!).
Luckily for you, the .groupby() method can sort all that data for you based on the criteria you provide. The important parameters to us today are:
by: A function, label, or list of labels that you want to use to determine grouping criteria. You can even provide a dictionary object where specific key:values pairs can be used to determine groupings.axis: How to split along the rows (0, default) or columns (1)as_index: A boolean to determine if the index labels should be based on group labels (True by default)
# How many individual dates/replicates are there?
len(embryo_merged_subset.loc[:,'date'].unique())
11
# Group our subset data by the 'date' column
embryo_merged_subset.groupby(by = ['date'])
# What is returned to us?
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001AE860C82D0>
1.2.2 Use the .head() method to view rows from each group¶
As you can see above, we create a DataFrameGroupBy object but if we attempted to look at it, it would look pretty much like the original merged_subset. The major difference is that the data has now been essentially sorted by the date column. In order to view part of it, we can use the head(n) method which will return n rows from each group.
# Group our subset data by the 'date' column and view 1 row from each group
embryo_merged_subset.groupby(by = ['date']).head(1)
| worm.number | date | wormStrain | pathogenStrain | pathogenDose | doseLevel | timepoint | merontsPresent | sporesPresent | numEmbryos | experiment | experimenter | description | Infection Date | Plate Number | Total Worms | Spore Lot | Lot concentration | Total ul spore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 190426 | AB1 | LUAm1 | 0.0 | Mock | 72hpi | False | False | 10 | 190426_AB1_LUAm1_0M_72hpi | CM | Wild isolate phenoMIP retest | 190423 | 7 | 1000 | 2A | 176000 | 0.0 |
| 594 | 1 | 200707 | ED3052A | LUAm1 | 0.0 | Mock | 72hpi | False | False | 12 | 200707_ED3052A_LUAm1_0M_72hpi | CM | Lua1 continuous Infection test JU1400 and ED30... | 200704 | 5 | 1000 | 2A | 176000 | 0.0 |
| 842 | 1 | 200714 | ED3052A | LUAm1 | 0.0 | Mock | 72hpi | False | False | 8 | 200714_ED3052A_LUAm1_0M_72hpi | CM | Lua1 continuous Infection test JU1400 and ED3052 | 200711 | 5 | 1000 | 2A | 176000 | 0.0 |
| 1092 | 1 | 200721 | ED3052A | LUAm1 | 0.0 | Mock | 72hpi | False | False | 12 | 200721_ED3052A_LUAm1_0M_72hpi | CM | Lua1 continuous Infection test JU1400 and ED3052 | 200718 | 5 | 1000 | 2A | 176000 | 0.0 |
| 1342 | 1 | 200821 | AWR144 | LUAm1 | 0.0 | Mock | 72hpi | False | False | 23 | 200821_AWR144_LUAm1_0M_72hpi | CM | Lua1 continuous Infection test 6X NIL, VC40171... | 200818 | 7 | 1000 | 2A | 176000 | 0.0 |
| 1692 | 1 | 200825 | AWR144 | LUAm1 | 0.0 | Mock | 72hpi | False | False | 14 | 200825_AWR144_LUAm1_0M_72hpi | CM | Lua1 continuous Infection test 6X NIL, VC40171... | 200822 | 7 | 1000 | 2A | 176000 | 0.0 |
| 1942 | 1 | 200904 | AWR144 | LUAm1 | 0.0 | Mock | 72hpi | False | False | 16 | 200904_AWR144_LUAm1_0M_72hpi | CM | NIL tests for Lua1 and ERTM5, low dose ERTM5 t... | 200901 | 7 | 1000 | 2A | 176000 | 0.0 |
| 5158 | 1 | 200915 | AWR144 | ERTm5 | 0.0 | Mock | 72hpi | False | False | 19 | 200915_AWR144_ERTm5_0M_72hpi | CM | NIL tests for ERTM5 | 200912 | 5 | 1000 | 2 | 427000 | 0.0 |
| 5358 | 1 | 200918 | AWR144 | ERTm5 | 0.0 | Mock | 72hpi | False | False | 26 | 200918_AWR144_ERTm5_0M_72hpi | CM | NIL tests for ERTM5 | 200915 | 5 | 1000 | 2 | 427000 | 0.0 |
| 10551 | 1 | 200905 | JU1400 | ERTm5-96H | 0.0 | Mock | 96hpi | False | False | 9 | 200905_JU1400_ERTm5-96H_0M_96hpi | CM | NIL tests for Lua1 and ERTM5, low dose ERTM5 t... | 200901 | 26 | 1000 | 2 | 427000 | 0.0 |
| 10649 | 1 | 200916 | JU1400 | ERTm5-96H | 0.0 | Mock | 96hpi | False | False | 15 | 200916_JU1400_ERTm5-96H_0M_96hpi | CM | NIL tests for ERTM5 | 200912 | 12 | 500 | 2 | 427000 | 0.0 |
Using the .groupby() method and .head(), we can view a representative observation from each group. To answer our above question, however, we need more information. If we repeat our above code but also include wormStrain and pathogenStrain columns in our grouping, what will that produce?
# Group our subset data by the 'date' and 'wormStrain' columns and grab the first row from each
# How big is the result?
embryo_merged_subset.groupby(by = ['date', 'wormStrain', 'pathogenStrain']).head(1).shape
(99, 19)
1.2.3 Subset and apply functions to your grouped DataFrame¶
From our results, we see there are 99 separate combinations of date : wormStrain : pathogenStrain in our grouped DataFrame! Now that we have our groups we can begin to apply functions to summarize data from each group.
Back to our question, we can now ask, which wormStrain produces the largest number of date:wormStrain:pathogenStrain combinations.
We've already used some functions like unique() but other helpful functions to apply include sum(), max(), min() and median(). Functions like idxmin() and idxmax() will return the index of the min and max values respectively but only the first occurence of the value being sought.
In this case what we really need is something that can help produce a frequency table. The .value_counts() method can be applied to a Series to produce such a table. We just need to provide it with the column(s) we'd like to use with the subset parameter.
In this case, we are interested in wormStrain so let's see what that looks like.
# Convert the "wormStrain" column from our grouped data to a frequency table
(embryo_merged_subset.groupby(by = ['date', 'wormStrain', 'pathogenStrain'])
# head(1) gets us a representative row from each group
.head(1)
# Create the frequency table
.value_counts(subset = ['wormStrain'])
)
wormStrain N2 29 JU1400 28 MY1 7 AWR145 6 AWR144 6 VC40171 3 VC20019 3 ED3052A 3 ED3052B 3 MY2 2 JU360 2 JU642 1 JU397 1 JU300 1 MY6 1 ED3042 1 CB4856 1 AB1 1 Name: count, dtype: int64
So, from our groupings it looks like N2 edges out JU1400 by just a single group. That's pretty close but now we've answered our question!
1.3.0 What is the mean number of embryos produced by uninfected animals in our study?¶
Let's approach this question by identifying our criteria:
- Filtering for uninfected animals
- Measuring mean embryos across all data
# Identify the mean embryo value of uninfected strains
# Subset by uninfected animals
(embryo_merged_subset.query('pathogenDose == 0')
# Group by wormStrain and take the numEmbryos columns
.groupby(by = ['wormStrain'])['numEmbryos']
# Calculate the mean
.mean()
)
wormStrain AB1 10.903226 AWR144 19.228000 AWR145 20.656000 CB4856 20.750000 ED3042 13.147541 ED3052A 10.526667 ED3052B 12.047297 JU1400 11.139501 JU300 15.622951 JU360 16.788618 JU397 11.457143 JU642 14.733333 MY1 11.725581 MY2 21.886179 MY6 18.360656 N2 18.912500 VC20019 15.143678 VC40171 4.586667 Name: numEmbryos, dtype: float64
1.3.1 Use sort_values() to organize your data¶
It looks like we get the results we're looking for but the data is sorted by alphabetical index order. What if we're interested, however, in finding the highest and lowest values in our dataset? Since it's small, a quick and easy way to ascertain this information is with the sort_values() function which will sort by ascending order.
If we had a multi-column DataFrame, we could use the by parameter to select multiple columns to sort with.
To sort in descending or reverse order, we can set the ascending = False parameter.
# Identify the mean embryo value of uninfected strains AND sort them!
# Subset by uninfected animals
(embryo_merged_subset.query('pathogenDose == 0')
# Group by wormStrain and take the numEmbryos colums
.groupby(by = ['wormStrain'])['numEmbryos']
# Calculate the mean
.mean()
# Sort the data by descending order
.sort_values(ascending = False)
)
wormStrain MY2 21.886179 CB4856 20.750000 AWR145 20.656000 AWR144 19.228000 N2 18.912500 MY6 18.360656 JU360 16.788618 JU300 15.622951 VC20019 15.143678 JU642 14.733333 ED3042 13.147541 ED3052B 12.047297 MY1 11.725581 JU397 11.457143 JU1400 11.139501 AB1 10.903226 ED3052A 10.526667 VC40171 4.586667 Name: numEmbryos, dtype: float64
We now have an answer to our question and have identified the mean number of embryos per uninfected animal in each strain. This is gets us a baseline value for each strain that can be used in later comparisons!
# Comprehension code answer 1.3.1
# Identify the mean embryo value of uninfected strains AND sort them!
# Subset by uninfected animals
(embryo_merged_subset.query('pathogenDose == 0')
# Group by wormStrain and take the numEmbryos columns
.groupby(by = ['wormStrain'])['numEmbryos']
# Calculate the mean
.mean()
# How do we calculate the median?
...
)
Comprehension Question 1.3.1 Answer:¶
1.4.0 What is the dispersion of the uninfected mean across different sets of replicate experiments?¶
Now that we've got a few very helpful tools under our belt, we can take our query to the next level and ask what the mean and standard deviation of any individual worm strain is across multiple replicates.
1.4.1 Plan your analysis strategy to avoid mistaken assumptions¶
Time to think about your dataset in relationship to your question. We already know that
- each worm strain may appear within any infection experiment
- a dose of 0 represents uninfected animals.
Again, it will be important to filter our data before summarizing it. Then you must understand which groupings you are looking for and what measurements you'd like to summarize. Let's break down the problem:
- Filter for uninfected animals.
- Group data by
date,wormStrain, andpathogenStrain. This will subgroup data into infection replicates! - Calculate the mean number of embryos.
- Summarize the data grouped by wormStrain.
# Determine the standard deviation of the mean embryo counts across each strain
# IN the uninfected state - ie a baseline embryo count.
# Query for only uninfected data
(embryo_merged_subset.query('pathogenDose == 0')
# Group by infection experiment
.groupby(by = ['date', 'wormStrain', 'pathogenStrain'])
# Isolate numEmbryos and generate a mean for each group
# Indexing like this returns a SeriesGroupBy object
['numEmbryos']
.mean()
# Group the series of means again by wormStrain
.groupby(['wormStrain'])
# Summarize each group of numEmbryo means
.describe()
# Sort the data by descending order
.sort_values(by = 'mean', ascending = False)
)
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| wormStrain | ||||||||
| MY2 | 2.0 | 21.892460 | 0.364216 | 21.634921 | 21.763690 | 21.892460 | 22.021230 | 22.150000 |
| CB4856 | 1.0 | 20.750000 | NaN | 20.750000 | 20.750000 | 20.750000 | 20.750000 | 20.750000 |
| AWR145 | 5.0 | 20.656000 | 2.160435 | 17.180000 | 20.520000 | 20.860000 | 21.760000 | 22.960000 |
| AWR144 | 5.0 | 19.228000 | 1.990407 | 16.940000 | 17.400000 | 20.060000 | 20.100000 | 21.640000 |
| N2 | 13.0 | 18.795973 | 2.816125 | 15.240000 | 16.300000 | 19.358209 | 20.880000 | 24.541667 |
| MY6 | 1.0 | 18.360656 | NaN | 18.360656 | 18.360656 | 18.360656 | 18.360656 | 18.360656 |
| JU360 | 2.0 | 16.799841 | 1.952303 | 15.419355 | 16.109598 | 16.799841 | 17.490085 | 18.180328 |
| JU300 | 1.0 | 15.622951 | NaN | 15.622951 | 15.622951 | 15.622951 | 15.622951 | 15.622951 |
| VC20019 | 3.0 | 15.181159 | 0.616995 | 14.468750 | 15.000000 | 15.531250 | 15.537364 | 15.543478 |
| JU642 | 1.0 | 14.733333 | NaN | 14.733333 | 14.733333 | 14.733333 | 14.733333 | 14.733333 |
| ED3042 | 1.0 | 13.147541 | NaN | 13.147541 | 13.147541 | 13.147541 | 13.147541 | 13.147541 |
| ED3052B | 3.0 | 12.020833 | 2.987960 | 10.062500 | 10.301250 | 10.540000 | 13.000000 | 15.460000 |
| MY1 | 4.0 | 11.611538 | 1.578067 | 9.520000 | 10.960000 | 11.840000 | 12.491538 | 13.246154 |
| JU397 | 1.0 | 11.457143 | NaN | 11.457143 | 11.457143 | 11.457143 | 11.457143 | 11.457143 |
| JU1400 | 12.0 | 11.217214 | 3.025072 | 5.724638 | 9.470000 | 11.720000 | 13.585000 | 14.660000 |
| AB1 | 1.0 | 10.903226 | NaN | 10.903226 | 10.903226 | 10.903226 | 10.903226 | 10.903226 |
| ED3052A | 3.0 | 10.526667 | 0.761665 | 9.700000 | 10.190000 | 10.680000 | 10.940000 | 11.200000 |
| VC40171 | 3.0 | 4.586667 | 1.571284 | 2.980000 | 3.820000 | 4.660000 | 5.390000 | 6.120000 |
Notice the presence of NaN values in our standard deviation columns? Can you tell why this is the case? What is the relationship between all of the strains with such a value?
1.5.0 How many pathogens have been tested on each strain?¶
Circling back again towards our initial questions, one in a similar vein would be to see how many pathogens have been tested on each strain. Let's remember to plan our analysis:
- We only want to examine data where the
pathogenDoseis >0 - We want to examine data based on each individual worm strain
- We are looking to identify the list of unique pathogen strains tested on each grouping
- We want to determine the size of each of those lists.
Let's start with the first 3 criteria:
# Query for only infected data
(embryo_merged_subset.query('pathogenDose > 0')
# Group our data by worm strain
.groupby('wormStrain')
# Isolate pathogen strain information in each group
['pathogenStrain']
# Determine the unique values in each
.unique()
)
wormStrain AB1 [LUAm1] AWR144 [LUAm1, ERTm5] AWR145 [LUAm1, ERTm5] CB4856 [ERTm5] ED3042 [LUAm1] ED3052A [LUAm1] ED3052B [LUAm1] JU1400 [LUAm1, ERTm5, AWRm78, LUAm3, MAM1, LUAm1-HK, ... JU300 [ERTm5] JU360 [LUAm1, ERTm2] JU397 [LUAm1] JU642 [LUAm1] MY1 [LUAm1, ERTm5] MY2 [ERTm5, ERTm2] MY6 [LUAm1] N2 [LUAm1, ERTm5, ERTm2, AWRm78, LUAm3, MAM1, LUA... VC20019 [LUAm1, ERTm5, ERTm2] VC40171 [LUAm1] Name: pathogenStrain, dtype: object
You can see from our above results that we have generated a Series object, where the pathogen strains associated with each worm strain are stored as a list-like object which turns out to be an np.array. We know that we can extract the .size property from those objects so that should get us our answer!
# Query for only infected data
(embryo_merged_subset.query('pathogenDose > 0')
# Group our data by worm strain
.groupby('wormStrain')
# Isolate pathogen strain information in each group
['pathogenStrain']
# Determine the unique values in each
.unique()
# Extract the size of each array
.size
)
18
Uh oh, just a single number - 18. That's actually how many worm strains we had in the Series object generated by the call to .unique(). What we wanted, instead, was the size of each element in the Series. How do we obtain that from what we have so far?
1.5.1 Use the .apply() method to broadcast a function to individual elements¶
We haven't spent much time discussing this about DataFrames, but you may recall that np.array objects can perform element-wise arithmetic. The general term for this ability is called broadcasting. In the case of DataFrames, we can perform similar functions.
We can:
- set the value of an entire dataset to a static value
- Use specific values/criteria to identify subsets of cells and modify those (via booleans!)
- apply pre-supplied mathematical functions to elements
apply()our own custom funcitons to elements
The apply() method takes the form of apply(func, args=(), **kwargs). We'll talk a little more about some of these parameters in later lectures but for now we have:
func: the name of the function you want to useargs: a tuple of any additional arguments that are needed forfuncto work.
Note that the .apply() method is object-dependent and will work differently depending on the pd.object for which it is being called. In our case, since we have a Series object, it will apply the .size attribute to each entry (array) in our Series.
You'll see one more unfamiliar piece of syntax: lambda. This is how we can tell Python we want to make a "quick function" on the spot. We'll cover this more in later lectures as well but simply put, it removes the need for a formal declaration of a function (Lecture 7). For now, we'll just roll with it.
# Query for only infected data
(embryo_merged_subset.query('pathogenDose > 0')
# Group our data by worm strain
.groupby('wormStrain')
# Isolate pathogen strain information in each group
['pathogenStrain']
# Determine the unique values in each
.unique()
# Extract the size of each array
.apply(lambda x: x.size)
)
wormStrain AB1 1 AWR144 2 AWR145 2 CB4856 1 ED3042 1 ED3052A 1 ED3052B 1 JU1400 9 JU300 1 JU360 2 JU397 1 JU642 1 MY1 2 MY2 2 MY6 1 N2 10 VC20019 3 VC40171 1 Name: pathogenStrain, dtype: int64
1.5.2 Use .nunique() on a grouped dataframe to return the number of unique elements¶
Looking at our code above, we went through 5 steps to get a final answer:
- Query the data.
- Group the data.
- Isolate the column we want to analyse for each group.
- Determine the unique values in the data.
- Retrieve the size of each array element.
What if instead, we used a helpful method - .nunique() to count the number of unique elements in our grouped dataframe. This method simplifies the process by combining the unique() and len() methods together (which we have used previously to achieve the same goal). Furthermore you can ignore the NA values by default. Using this method simplifies our process slightly into just 3 steps of code as we'll see.
# Query for only infected data
(embryo_merged_subset.query('pathogenDose > 0')
# Group our data by worm strain
.groupby('wormStrain')
# Determine the length of unique values in each
.nunique()
)
| worm.number | date | pathogenStrain | pathogenDose | doseLevel | timepoint | merontsPresent | sporesPresent | numEmbryos | experiment | experimenter | description | Infection Date | Plate Number | Total Worms | Spore Lot | Lot concentration | Total ul spore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wormStrain | ||||||||||||||||||
| AB1 | 60 | 1 | 1 | 2 | 2 | 1 | 1 | 1 | 16 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 2 |
| AWR144 | 50 | 5 | 2 | 2 | 2 | 1 | 2 | 2 | 22 | 6 | 1 | 3 | 5 | 3 | 1 | 2 | 2 | 2 |
| AWR145 | 50 | 5 | 2 | 2 | 2 | 1 | 2 | 2 | 14 | 6 | 1 | 3 | 5 | 3 | 1 | 2 | 2 | 2 |
| CB4856 | 64 | 1 | 1 | 2 | 2 | 1 | 1 | 2 | 27 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 2 |
| ED3042 | 56 | 1 | 1 | 2 | 2 | 1 | 2 | 1 | 10 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 2 |
| ED3052A | 50 | 3 | 1 | 2 | 2 | 1 | 2 | 1 | 14 | 4 | 1 | 2 | 3 | 2 | 1 | 1 | 1 | 2 |
| ED3052B | 50 | 3 | 1 | 2 | 2 | 1 | 2 | 1 | 13 | 4 | 1 | 2 | 3 | 2 | 1 | 1 | 1 | 2 |
| JU1400 | 115 | 11 | 9 | 10 | 6 | 2 | 2 | 2 | 25 | 41 | 1 | 7 | 9 | 18 | 3 | 3 | 5 | 12 |
| JU300 | 69 | 1 | 1 | 2 | 2 | 1 | 1 | 2 | 17 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 2 |
| JU360 | 65 | 1 | 2 | 4 | 2 | 1 | 1 | 2 | 23 | 4 | 1 | 1 | 1 | 4 | 1 | 2 | 2 | 4 |
| JU397 | 60 | 1 | 1 | 2 | 2 | 1 | 2 | 1 | 17 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 2 |
| JU642 | 61 | 1 | 1 | 2 | 2 | 1 | 1 | 1 | 19 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 2 |
| MY1 | 62 | 4 | 2 | 5 | 3 | 1 | 2 | 2 | 24 | 12 | 1 | 3 | 4 | 6 | 1 | 2 | 2 | 5 |
| MY2 | 71 | 1 | 2 | 4 | 2 | 1 | 1 | 2 | 34 | 4 | 1 | 1 | 1 | 4 | 1 | 2 | 2 | 4 |
| MY6 | 61 | 1 | 1 | 2 | 2 | 1 | 2 | 1 | 22 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 2 |
| N2 | 85 | 11 | 10 | 12 | 6 | 2 | 2 | 2 | 37 | 43 | 1 | 7 | 9 | 21 | 3 | 3 | 6 | 15 |
| VC20019 | 60 | 1 | 3 | 6 | 2 | 1 | 2 | 2 | 20 | 6 | 1 | 1 | 1 | 6 | 1 | 3 | 3 | 6 |
| VC40171 | 50 | 3 | 1 | 1 | 1 | 1 | 2 | 1 | 2 | 3 | 1 | 2 | 3 | 1 | 1 | 1 | 1 | 1 |
In 3 commands, we were able to determine not only the number of unique values in pathogenStrain but also the unique count across all of the data columns! What happens if we want to do more than just calculate a single value?
1.5.3 Use the .agg() method to generate a summary from multiple functions¶
Rather than using just a single method like .nunique() you can actually apply multiple functions to a grouped dataframe to generate a summary of the information. In this case we'll return to looking at the pathogenStrain column to simplify our output. We are also a little limited by the kind of summary we can achieve from string data. We can, however, still count() the number of entries in each group.
To combine both of these methods to produce a summary, we'll use the .agg() method, which will accept a list of function names like "count" and "nunique" but also "sum", "min", "max" and other methods found in the GroupBy object. You can also use your own custom function just like the .apply() method.
# Query for only infected data
(embryo_merged_subset.query('pathogenDose > 0')
# Group our data by worm strain
.groupby('wormStrain')
# Isolate pathogen strain information in each group
['pathogenStrain']
# Determine the total number of values in each group, the number of unique, and what they are
.agg([lambda x: x.size, "count", "nunique", "unique"])
)
| <lambda_0> | count | nunique | unique | |
|---|---|---|---|---|
| wormStrain | ||||
| AB1 | 120 | 120 | 1 | [LUAm1] |
| AWR144 | 300 | 300 | 2 | [LUAm1, ERTm5] |
| AWR145 | 300 | 300 | 2 | [LUAm1, ERTm5] |
| CB4856 | 125 | 125 | 1 | [ERTm5] |
| ED3042 | 107 | 107 | 1 | [LUAm1] |
| ED3052A | 183 | 183 | 1 | [LUAm1] |
| ED3052B | 200 | 200 | 1 | [LUAm1] |
| JU1400 | 2109 | 2109 | 9 | [LUAm1, ERTm5, AWRm78, LUAm3, MAM1, LUAm1-HK, ... |
| JU300 | 132 | 132 | 1 | [ERTm5] |
| JU360 | 251 | 251 | 2 | [LUAm1, ERTm2] |
| JU397 | 113 | 113 | 1 | [LUAm1] |
| JU642 | 121 | 121 | 1 | [LUAm1] |
| MY1 | 613 | 613 | 2 | [LUAm1, ERTm5] |
| MY2 | 250 | 250 | 2 | [ERTm5, ERTm2] |
| MY6 | 112 | 112 | 1 | [LUAm1] |
| N2 | 2221 | 2221 | 10 | [LUAm1, ERTm5, ERTm2, AWRm78, LUAm3, MAM1, LUA... |
| VC20019 | 323 | 323 | 3 | [LUAm1, ERTm5, ERTm2] |
| VC40171 | 150 | 150 | 1 | [LUAm1] |
1.5.4 Use different methods for different variables with the .agg() method¶
In our previous example we isolated the pathogenStrain column for the .agg() method but we can actually choose one or more columns to apply specific functions on using a dictionary object where the key:value pairs take on the form of column_name:[func_1, ..., func_n].
Let's modify our above code to gather the mean and standard deviation of numEmbryos for each infected worm strain (regardless of pathogen), while also counting the number of animals measured for each strain.
# Query for only infected data
(embryo_merged_subset.query('pathogenDose > 0')
# Group our data by worm strain
.groupby('wormStrain')
# Determine the total number of values in each group, the number of unique, and what they are
.agg({'worm.number':['count'],
'numEmbryos':['mean', 'std']})
)
| worm.number | numEmbryos | ||
|---|---|---|---|
| count | mean | std | |
| wormStrain | |||
| AB1 | 120 | 6.433333 | 4.154037 |
| AWR144 | 300 | 7.743333 | 4.980644 |
| AWR145 | 300 | 2.196667 | 3.098925 |
| CB4856 | 125 | 14.656000 | 6.054995 |
| ED3042 | 107 | 1.598131 | 2.794401 |
| ED3052A | 183 | 2.568306 | 3.124607 |
| ED3052B | 200 | 2.280000 | 2.695912 |
| JU1400 | 2109 | 4.055477 | 4.829862 |
| JU300 | 132 | 6.651515 | 3.021513 |
| JU360 | 251 | 3.920319 | 5.707681 |
| JU397 | 113 | 9.380531 | 3.480349 |
| JU642 | 121 | 7.446281 | 4.151205 |
| MY1 | 613 | 5.389886 | 6.395696 |
| MY2 | 250 | 9.036000 | 8.560833 |
| MY6 | 112 | 9.294643 | 5.625935 |
| N2 | 2221 | 10.373706 | 6.531484 |
| VC20019 | 323 | 5.170279 | 5.222870 |
| VC40171 | 150 | 0.026667 | 0.230164 |
Notice the above output generates subgroupings within our columns so each individual columns is multi-level in it's labeling! This is an important aspect when trying to pass around the various columns. To identify individual columns, they must be described as a tuple which takes the form of ('level_1', 'level_2', ... 'level_n'). For example from our above output, the std column is actually labeled as ('numEmbryos', 'std').
You can confirm this by adding a .columns call to our above code!
1.5.5 Use more complex filtering with the .query() method¶
Thus far we have limited our query() filtering to simply choosing between pathogenDose == 0 or pathogenDose >0. What if we were also interested in only looking at a subset of our data for this analysis? The two most prevalent strains are N2 and JU1400. How can we filter for only that data in our analysis?
Suppose we wanted to know the count, mean, and standard deviation in number of embryos for just N2 and JU1400 animals tested in mock infection replicates - basically retrieving their baseline uninfected states. We can do a slightly more complex filtering with a few more predicates. To accomplish this we'll revisit 2 keywords:
inthis can search for membership of values within a specifiedlist. ievalues in ['query1', 'query2']&this can be used to join predicates together so that one or more conditions must be met before returning aTruevalue (more on this in Lecture 5). Normally, this is known as the "bitwise AND" which operates in a particular way, but for us, we'll use it as a "logical AND". More on this in section 3.3.0.
In our case we want to meet 2 conditions: wormStrain in ['N2', 'JU1400' AND pathogenDose == 0. Let's see how we can join these into a single query call BUT look closely at how we use single and double-quotation marks in the .query() method!
# Query for only infected data
(embryo_merged_subset.query('wormStrain in ["N2", "JU1400"] & pathogenDose == 0')
# Group our data by worm strain
.groupby('wormStrain')
# Determine the number of observations in each group, and summary stats
.agg({'worm.number':['count'],
'numEmbryos':['mean', 'std']})
)
| worm.number | numEmbryos | ||
|---|---|---|---|
| count | mean | std | |
| wormStrain | |||
| JU1400 | 681 | 11.139501 | 4.442401 |
| N2 | 720 | 18.912500 | 5.118279 |
1.5.6 Find the right tool to save yourself time¶
Now that we've walked through a few analysis directions and introduced a number of great tools, you'll notice a basic pattern to most analyses:
- Filter data with
.query(). - Group data with
.groupby(). - Summarize data with premade functions separately or in combination with
.agg()or with custom functions viaapply().
After more practice and experience some of these functions will become second nature to your coding choices.
Use the code cell below to help generate an answer.
# Comprehension code answer 1.5.6
# Query for only infected data
(embryo_merged_subset.query('pathogenDose > 0')
# Group our data by worm strain
.groupby(by = ['pathogenStrain', 'pathogenDose'])
# Determine the total number of unique worm strain values in each group, the number of unique, and what they are
.agg(...)
# Sort the values in nunique to simplify your search
.sort_values(...)
)
2.0.0 Plotting with the pyplot module¶
2.0.1 Use exploratory plots to assess your data¶
Now that we've had a chance to look at our data close up, let's talk about how we can use exploratory plots to give us a quick visual assessment of our data. We can use these visualizations to help make decisions about how to further analyse our data. Is there a difference between different groups of data? Does it look like there might be any bias between our datasets? What does the overall distribution of our sampling look like?
2.0.2 Types of plots¶
Often when trying to convey a message about our data through a visualization, we want to choose the right kind of visualization. These visualizations can also be referred to as figures or plots. Within the maplotlib package is the pyplot module. It is a collection of functions that give the matplotlib package capabilities that are very similar to the programming language MATLAB. The pyplot module has functions that can create some of the following basic plots:
| Plot type | Command | What to use it for |
|---|---|---|
| Bar plot / histogram | bar() |
Population data summaries. Helpful for contrasting between groups |
| Scatter plot | scatter() |
Multiple independent measurements across different variables |
| Line plot | plot() |
Multiple measurements that represent the same sample(s) |
| Histogram | hist() |
Generate a distribution by binning your data |
| Stem or lollipop | stem() |
A twist on the bar plot that may be more compact and visually pleasing |
| Boxplots | boxplot() |
Create a visual summary of your datapoints based on their distribution |
| Violin plots | violinplot() |
Create a visual kernel density (distribution) estimate of datapoints |
2.0.3 Plot components¶
Within each plot are a number of basic components: titles, axis properties, legends, etc. Here is a helpful table outlining some of the basic plot components.
|Component|Description|Command|Parameters|
|:-:|:-|:-|:-|
|Title|The title of your plot|title()|
|X- or Y-axis title| The axis titles of your plot| xlabel(), ylabel()| xlabel=str |
||||loc={'left', 'center', 'right'}|
||||Text properties|
|X- or Y-axis ticks| Alter your axis tick positions/locations and labels | xticks(), yticks()| ticks=[a,..n]|
||||labels=[label1, ..., labeln]|
|Axis limits| A list defining the x- and y-axis limits| axis()|[xMin, xMax, yMin, yMax]|
|Axis scale| Set the kind of axis scale for your data|xscale(), yscale()|"linear", "log", "symlog", "logit"|
|Text properties|Labels can take text parameters too||color, fontsize, fontstyle, rotation|
2.1.0 Build a basic barplot with the bar() method¶
We'll use our worm embryo values as an example to try and plot some of our data as a bar plot. The bar() method generally requires two sets of data to be supplied along with some optional data:
x: An array of x-coordinates (group labels, or x values)height: A float or array of bar heights - these are usually the measured/summarized valueswidth: A float or array of bar widths (default is 0.8)bottom: The y-coordinates of the bars (default is 0)align: The alignment of the bars to your x-coordinate labels (default iscenter)
Let's start by building a basic barplot examining the baseline mean number of embryos in our various worm strains. First we'll make a DataFrame object holding the relevant values. Then we'll build our barplot and see what needs to be altered as we move forward. Note that we use the plt.show() method to display our plot after putting a lot of pieces together.
# We'll reset the code cells to only show the last code call. This will de-clutter the plotting process for us
InteractiveShell.ast_node_interactivity = "last"
# Determine the mean number of embryos across all non-infected observations per strain
wormStrain_mean_embryos = (embryo_merged_subset.query('pathogenDose == 0')
# Group data by worm strains
.groupby('wormStrain')
# Get the mean value in each group
.agg({'numEmbryos':'mean'})
# Sort the data
.sort_values(by = 'numEmbryos', ascending = False)
)
# Check on the results
wormStrain_mean_embryos
| numEmbryos | |
|---|---|
| wormStrain | |
| MY2 | 21.886179 |
| CB4856 | 20.750000 |
| AWR145 | 20.656000 |
| AWR144 | 19.228000 |
| N2 | 18.912500 |
| MY6 | 18.360656 |
| JU360 | 16.788618 |
| JU300 | 15.622951 |
| VC20019 | 15.143678 |
| JU642 | 14.733333 |
| ED3042 | 13.147541 |
| ED3052B | 12.047297 |
| MY1 | 11.725581 |
| JU397 | 11.457143 |
| JU1400 | 11.139501 |
| AB1 | 10.903226 |
| ED3052A | 10.526667 |
| VC40171 | 4.586667 |
# Build our barplot by giving the index as the x-label, and values as the height
# We'll need to supply the data directly to each parameter
plt.bar(x = wormStrain_mean_embryos.index,
height = wormStrain_mean_embryos['numEmbryos'])
# Show our plot
plt.show()
2.1.1 Use the figure() function to set your figure size¶
As we can see from our first attempt, the plot is rather small. We definitely have some problems with the basics of this plot and we'll address the first one that might help. This figure could be a little larger so we can see the x-axis labels better. Use the figure() method to set your figure size using the figsize parameter.
## 2.1.1 Fix the size of the plot
plt.figure(figsize = (12,5))
# Build our barplot
plt.bar(x=wormStrain_mean_embryos.index,
height=wormStrain_mean_embryos['numEmbryos'])
# Show our plot
plt.show()
2.1.2 Rotate your x-axis text with the xticks() function¶
Now that our plot is larger, we can still see the x-axis labels are still too large. We can, however, rotate the text and see if that helps. Let's rotate to a 90-degree angle. We'll alter these axis properties through the xticks() method.
# Fix the size of the plot
plt.figure(figsize = (12, 5))
# Build our barplot
plt.bar(x=wormStrain_mean_embryos.index,
height=wormStrain_mean_embryos['numEmbryos'])
## 2.1.2 Rotate the x-axis text
plt.xticks(rotation = 90)
# Show our plot
plt.show()
2.1.3 Add labels to your plot¶
Okay, the plot is larger and we've fixed our x-axis label issues. No more overcrowding! Let's add a main title and axis titles to our dataset. We can use the title(), xlabel(), and ylabel() functions in this case. While we set the labels, we can also set their properties such as fontsize, fontstyle, and color.
# Fix the size of the plot
plt.figure(figsize = (12, 5))
# Build our barplot
plt.bar(x=wormStrain_mean_embryos.index,
height=wormStrain_mean_embryos['numEmbryos'])
# Rotate the x-axis text
plt.xticks(rotation = 90)
## 2.1.3 Add titles
plt.title("Baseline mean embryos per worm strain", fontsize = "x-large")
# You can play with font style
plt.xlabel("worm strain", fontstyle = "italic")
# And even the font colour!
plt.ylabel("mean embryo count", color = "r")
# Show our plot
plt.show()
2.1.4 Alter properties of your barplot¶
Similar to altering text properties, most of the plots have the ability to alter various properties like fill and line
color or other plot-specific attributes. Let's update the fill and line colours for our barplot using the color and edgecolor parameters.
# Fix the size of the plot
plt.figure(figsize = (12, 5))
# Build our barplot
plt.bar(x=wormStrain_mean_embryos.index,
height=wormStrain_mean_embryos['numEmbryos'],
## 2.1.4 you can also change some barplot aspects
color="orchid",
edgecolor="black")
# Rotate the x-axis text
plt.xticks(rotation = 90)
# Add titles
plt.title("Baseline mean embryos per worm strain", fontsize = "x-large")
plt.xlabel("worm strain", fontstyle = "italic")
plt.ylabel("mean embryo count", color = "r")
# Show our plot
plt.show()
?plt.axis
# Comprehension answer code 2.0.0
# Fix the size of the plot
plt.figure(figsize = (12, 5))
# Build our barplot
plt.bar(x=wormStrain_mean_embryos.index,
height=wormStrain_mean_embryos['numEmbryos'],
color="orchid",
edgecolor="black")
# Rotate the x-axis text
plt.xticks(rotation = 90)
# Add titles
plt.title("Baseline mean embryos per worm strain", fontsize = "x-large")
plt.xlabel("worm strain", fontstyle = "italic")
plt.ylabel("mean embryo count", color = "r")
# Alter the axis limits
plt.axis(...)
# Show our plot
plt.show()
3.0.0 Basic visualizations with the seaborn package¶
Building upon our visualizations in the last section, there are some common themes you might recognize about them. We have a plot area, x- and y-axis data, axis limits, and plot colors. Using matplotlib to help generate your visualizations, you can control many small details but it can also be tedious at times to micromanage so many aspects of your plot.
The seaborn package is actually built upon the pyplot module and tries to bring a high-level approach to statistical plots. As we'll see later on, this means updating certain details of our plots will require an understanding of the base matplotlib and pyplot functions.
3.0.1 The seaborn package subdivides plot types into 3 categories¶
The seaborn package takes a dual-pronged approach to generating plots. There are functions considered to work at the Figure level and then there are functions that affect what is known as the Axes level. To simplify the concept:
Axes: A single plot defined by an x and y axis grid. This includes all of the basic plots like scatter and box plots.Figure: A plot space that can range from 1 to multipleAxes. The setup ofAxescan be simple to complex.
At the Axes levels, there are 3 categories of plot types based on their similarity: relational, distribution, and categorical. For each of these categories there is a figure-level function that can be used to create multi-panel (faceted) versions of these plots by splitting the data further based on categorical variables.
![]() |
|---|
From the seaborn overview: for most simple plots, one of the above figure or axes-level plots can be utilized. |
The above functions are used to initialize figure and axes objects by identifying a number of properties. Within these, some of the options can vary greatly based on plot type. Of the two levels of functions, their influence on figure attributes can vary:
- Figure-level: Return a
FacetGridobject that has some additional methods for altering attributes of the plot in a way that makes sense to the subplot organization. - Axes-level: Add axis labels and legends to the
Axesthey are drawn onto but do not alter the figure in any other way. You can choose to draw onto the current axes in memory OR specify the reference to an axes which may be within a larger figure.
3.1.0 Introduction to the Grammar of Graphics¶
One approach to effective data visualization relies on the Grammar of Graphics framework originally proposed by Leland Wilkinson (2005).The idea of grammar can be summarized as follows:
- Grammar is the foundational set of rules that define the components of a language.
- A language is built on a structure that consists of syntax and semantics.

The grammar of graphics is a language to communicate about what we are plotting programmatically
It begins with a tidy data frame. It will have a series of observations (rows) each of which will be described across multiple variables (columns). Variables can actually represent qualitative or quantitative measurements or they could be descriptive data about the experiments or experimental groups.
The data units may undergo conversion through a process called scaling (transformation) before being used for plotting.
A subset of data columns are then passed on to be presented in various data plots (scatterplots, boxplots, kernel density estimates, etc.) by using the data to describe visual properties of the plot. We call these visual properties, the aesthetics of the plot. For example, the data being plotted or represented can be visually altered in shape or colour based on accompanying column data.
A plot can have multiple layers (for example, a scatter plot with a regression line) and each of these plot types is referred to as a geom (short for geometric object).
3.1.1 The grammar of graphics with seaborn¶
The grammar of graphics facilitates the concise description of any components of any graphics. Hadly Wickham of R tidyverse fame has proposed a variant on this concept - the layered grammar of graphics framework in the ggplot2 package for R. By following a layered approach of defined components, it can be easy to build a visualization.
In a similar manner, the seaborn package has some methods that facilitate a layering approach to building your visualizations. However, many of the details are built upon the foundation of layering Axes objects or alterations upon Figure objects.
Each Axes-level function usually takes in:
Data: your visualization always starts here. What are the dimensions you want to visualize. What aspect of your data are you trying to convey?Aesthetics: assign your axes based on the data dimensions you have chosen. Where will the majority of the data fall on your plot? Are there other dimensions (such as categorically encoded groupings) that can be conveyed by aspects like size, shape, colour, fill, etc.Geometric objects: how will you display your data within your visualization. Which*plotwill you use?
The figure-level methods can be used to alter or update:
Scale: do you need to alter your x or y-axis limits? What about scaling/transforming any values to fit your data within a range? Sometimes, depending on the Geometric object, you are better off transforming your data ahead of time.Facets: will generating subplots of the data add a dimension to your visualization that would otherwise by lost or hard to discern?Coordinate system: will your visualization follow a classic cartesian, semi-log, polar, etc. coordinate system?
Let's jump into our first dataset and start building some plots with it shall we?
3.1.2 Import infection_signal.tsv¶
Before we dig into the seaborn package, we will import a new dataset that has a slightly more diverse array of data that we can use for showcasing the plotting power of seaborn. To accomplish our task, we'll begin by importing an alternative dataset of measurements from infection_signal.tsv which consists of area measurements (total pixels) of animals from images after pathogen infection.
# We'll reset the code cells to only show FINAL code output.
InteractiveShell.ast_node_interactivity = "last"
# Read the pitlatrine data in from file
infectionSig_data = pd.read_csv("data/infection_signal.tsv", sep = '\t')
# Look at the first 5 rows of data
infectionSig_data.head(5)
| exp.name | strain | spore.strain | spore.species | dose | spores | fixing.date | slide | file | worm.number | area | percent.infected | area.infected | timepoint | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | N2-LUAm1-1.8 | N2 | LUAm1 | N.ferruginous | pulse-72H | 1.8 | rep1 | 1 | N2.LUAm1.rep1 | 1 | 49838.02 | 18.53 | 9234.985106 | 72hpi |
| 1 | N2-LUAm1-1.8 | N2 | LUAm1 | N.ferruginous | pulse-72H | 1.8 | rep1 | 1 | N2.LUAm1.rep1 | 2 | 50425.04 | 0.00 | 0.000000 | 72hpi |
| 2 | N2-LUAm1-1.8 | N2 | LUAm1 | N.ferruginous | pulse-72H | 1.8 | rep1 | 1 | N2.LUAm1.rep1 | 3 | 45532.67 | 31.16 | 14187.979970 | 72hpi |
| 3 | N2-LUAm1-1.8 | N2 | LUAm1 | N.ferruginous | pulse-72H | 1.8 | rep1 | 1 | N2.LUAm1.rep1 | 4 | 46458.55 | 3.88 | 1802.591740 | 72hpi |
| 4 | N2-LUAm1-1.8 | N2 | LUAm1 | N.ferruginous | pulse-72H | 1.8 | rep1 | 1 | N2.LUAm1.rep1 | 5 | 49214.73 | 0.00 | 0.000000 | 72hpi |
# How big is this dataframe?
infectionSig_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 456 entries, 0 to 455 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 exp.name 456 non-null object 1 strain 456 non-null object 2 spore.strain 456 non-null object 3 spore.species 456 non-null object 4 dose 456 non-null object 5 spores 456 non-null float64 6 fixing.date 456 non-null object 7 slide 456 non-null int64 8 file 456 non-null object 9 worm.number 456 non-null int64 10 area 456 non-null float64 11 percent.infected 456 non-null float64 12 area.infected 456 non-null float64 13 timepoint 456 non-null object dtypes: float64(4), int64(2), object(8) memory usage: 50.0+ KB
# What are the unique values in each column?
infectionSig_data.apply(pd.unique, axis = 0)
exp.name [N2-LUAm1-1.8, JU1400-LUAm1-1.8, AWR144-LUAm1-... strain [N2, JU1400, AWR144, AWR145] spore.strain [LUAm1] spore.species [N.ferruginous] dose [pulse-72H] spores [1.8] fixing.date [rep1, rep2, rep3] slide [1] file [N2.LUAm1.rep1, JU1400.LUAm1.rep1, AWR144.LUAm... worm.number [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... area [49838.02, 50425.04, 45532.67, 46458.55, 49214... percent.infected [18.53, 0.0, 31.16, 3.88, 23.54, 1.65, 8.6, 18... area.infected [9234.985106, 0.0, 14187.97997, 1802.59174, 93... timepoint [72hpi] dtype: object
Taking a quick look at our data we can summarize it briefly here and see that there are a few categories we can explore across variables like strain (4 types) and fixing.date (3 replications) for measured variables like area (total area of a worm), area.infected (infection signal area) and percent.infected (infected area as a % of the total).
3.2.0 Build a scatterplot using the relplot() function¶
We'll start by building a basic scatterplot. We'll focus on comparing the total chemical oxygen demand versus the total solids count. Rather than working at the axes-level, we'll work with the encompassing relplot() method which will give us flexibility in our visual exploration down the road.
For the basic plot relplot(), we'll start with the following parameters:
data: The tidy (long-form) data set we want to visualize.x,y: The variable names for assigning x- and y-axis values.height: The total height of our figure.aspect: A scalar value used to determine the width of your figure (width = height * aspect).kind: The type of plot we want to produce: scatter (default) vs linekwargs: This is a catch-all for any other keyword arguments that could be passed onto underlying functions like theAxes-level methods.
# Import your seaborn package
import seaborn as sns
# We'll build a scatterplot
snsPlot = sns.relplot(data = infectionSig_data, # Set the data
height = 6, aspect = 1, # Set the size of the figure
kind = "scatter" # Set the figure type
)
# Show the plot (avoid extra object information)
plt.show()
Looking at the output we can see that without having specified the x and y axis values, it simply plotted all of the variables along a default x-axis of index number (456 rows total). Many of these values have no real relationship to each other at all. Let's try again by setting our axis variables.
# We'll build a scatterplot
snsPlot = sns.relplot(data = infectionSig_data,
x = "area", y = "area.infected", # Set the axis variables
height = 6, aspect = 1,
kind = "scatter"
)
# Show the plot (avoid extra object information)
plt.show()
3.2.1 Specify colouring by category with the hue parameter¶
Now we begin to see the power of having a tidy DataFrame. Since each of our observations is in its own row, we can classify each observation by factors like strain! Using the seaborn package, we can specify the colour of our points using the hue parameter. Since we have set the data = infectionSig_data parameter, we can tell seaborn to look at a specific column when determining the hue parameter.
At the same time, we'll set the alpha parameter which is essentially the opacity of each datapoint. Setting a lower value increases transparency which allows us to see overlapping datapoints better. This parameter becomes especially helpful when working with extremely dense datapoints.
# We'll build a scatterplot
snsPlot = sns.relplot(data = infectionSig_data,
x = "area", y = "area.infected",
hue = "strain", # Set the point-colour by strain
alpha = 0.6, # Set the transparency of the points
height = 6, aspect = 1,
kind = "scatter"
)
# Show the plot (avoid extra object information)
plt.show()
3.2.2 Alter your axis scale with the .set() method¶
The .set() method is a gateway to altering a number of aspects of your plot. Once we have our plot saved as an object named snsplot we can alter or set some of it's properties this way. In particular we will be using the yscale parameter to adjust the y-axis log scale.
Note that our plot object is actually kept as a type of matplotlib.axes.Axes and that's where we are calling the .set() method from. We can actually set quite a few figure attributes through this method.
# We'll build a scatterplot
snsPlot = sns.relplot(data = infectionSig_data,
x = "area", y = "area.infected",
hue = "strain", # Set the point-colour by strain
alpha = 0.6, # Set the transparency of the points
height = 6, aspect = 1,
kind = "scatter"
)
## 3.2.2 Change your y-axis to a log scale
snsPlot.set(yscale="log")
# Show the plot (avoid extra object information)
plt.show()
3.2.3 Facet your data into different plots with the relplot() function¶
If we want to split our data into multiple new plots based on certain variables, this is known as faceting your data. Usually this results in a grid-like pattern where data is grouped by categories of one variable as columns, and another variable by rows. Although the data could simply be split on a single variable instead. Either way, this generates a figure-level object known as a FacetGrid().
The relplot() method already has the capability to handle this splitting of axes within the figure it generates. The relplot() method can facet data across two variables using the row and col parameters. We simply need to name the variable(s) that will be used to categorize the data.
To summarize, the parameters to use for this operation are:
col: The variable name that will be used to group the columns of your grid.row: The variable name that will be used to group the rows of your grid.
Below, we'll remove the colouring of points based on strain and instead, split the data into two Axes based on this information.
# We'll build a scatterplot
snsPlot = sns.relplot(data = infectionSig_data,
x = "area", y = "area.infected",
alpha = 0.6, # Set the transparency of the points
height = 6, aspect = 0.6, # Set the size of the figure
kind = "scatter",
col = "strain" ## 3.2.3 Split the columns of the grid by country
)
# Change your y-axis to a log scale
snsPlot.set(yscale="log")
# Show the plot (avoid extra object information)
plt.show()
3.2.4 Use a continuous variable to colour your data¶
Note that we typically need to just apply one attribute to each dimension of data we are investigating. By splitting the data by strain we no longer need to colour it based on this category. We can, however, add additional information to our visualization by using another dimension in our data. Instead of colouring the points based on a categorical variable (strain), we can use a continuous variable like percent.infected from our dataset to see if there could be a trend in relation to our data.
It is easy enough to set this dimension using the hue parameter in our initial relplot() call. We'll also set the palette parameter to a different colour. We'll also take a second to set the edgecolor parameter so that our lower-value/white points can still be seen.
# We'll build a scatterplot
snsPlot = sns.relplot(data = infectionSig_data,
x = "area", y = "area.infected",
hue = "percent.infected", palette = "Reds", ## 3.2.4 Set the point-colour by percent infected
edgecolor = "black", ## 3.2.4 Set the point border colour so we can see them all
alpha = 0.6,
height = 6, aspect = 0.6,
kind = "scatter",
col = "strain" # Split the columns of the grid by country
)
# Change your y-axis to a log scale
snsPlot.set(yscale="log")
# Show the plot (avoid extra object information)
plt.show()
By colouring our datapoints using percent.infected, we can now see more clearly on the same plot that samples with higher area.infected values but lower overall area, tend to trend towards having a higher overall infected area percentage. Not a surprising result BUT very cool to see when colouring our datapoints. Rather than generating additional plots comparing different pairs of variables, we've simply added an additional dimension of information to our visualization.
3.2.5 Update your axis using the .set_axis_labels() method¶
The names of our axis titles are drawn from the variable names we used for the original DataFrame but we may be limited in how those variables are originally named. In other cases you may wish to add units, or simply make your axis titles more descriptive. To accomplish this we can alter our labels directly using the set_axis_labels() method. The parameters to set are x_var and y_var in that order. Set them directly if you only want to change a single axis title.
# We'll build a scatterplot
snsPlot = sns.relplot(data = infectionSig_data,
x = "area", y = "area.infected",
hue = "percent.infected", palette = "Reds", # Set the point-colour by percent infected
edgecolor = "black", # Set the point border colour so we can see them all
alpha = 0.6,
height = 6, aspect = 0.6,
kind = "scatter",
col = "strain" # Split the columns of the grid by country
)
## 3.2.5 Set the axis titles (Aesthetics)
snsPlot.set_axis_labels(x_var = "Total area", y_var = "Area infected")
# Change your y-axis to a log scale
snsPlot.set(yscale="log")
# Show the plot (avoid extra object information)
plt.show()
3.2.6 Set the shape of your points with the style parameter¶
We'll switch gears a little at this point and ask what our data looks like when we compare across our fixing.date replicates instead. For each strain facet, we'll separate the data by replicate by changing their shape using the style parameter instead. Thus the visualization of our data will be grouped further into replicate values while still faceting on actual worm strain.
# We'll build a scatterplot
snsPlot = sns.relplot(data = infectionSig_data,
x = "area", y = "area.infected",
hue = "percent.infected", palette = "Reds", # Set the point-colour by percent infected
edgecolor = "black", # Set the point border colour so we can see them all
style = "fixing.date", # 3.2.6 Change the style of our points based on variable values
alpha = 0.6,
height = 6, aspect = 0.5,
kind = "scatter",
col = "strain" # Split the columns of the grid by country
)
# Set the axis titles (Aesthetics)
snsPlot.set_axis_labels(x_var = "Total area", y_var = "Area infected")
# Change your y-axis to a log scale
snsPlot.set(yscale="log")
# Show the plot (avoid extra object information)
plt.show()
3.2.7 Using the right colours for your visualization with the color_palette() function¶
From the output of our last visualization you may wonder if we can do better with colouring based on the percent.infected values. As our values approach 0, things get very hard to see and this prompted us to use a dark outline for our points. Another helpful considering is the use of an appropriate colour palette. These can be accessed via the the seaborn.color_palette() function which includes the parameters:
palette: the name of the palette you wish to access (add "_r" to the name to reverse it!)n_colours: the number of colours from the palette you'd like to access (most named palettes have 6 different colours)as_cmap: a boolean to determine if you'd like to create a continuous mapping of values across a gradient of palette colours (think heatmaps!)
When calling on this function, it will return to you a series of colour values either in a discrete or continuous gradient.
We'll update our figure with a colour-blind friendly "viridis" palette in a reverse so our higher values are darker in colour.
# We'll build a scatterplot
snsPlot = sns.relplot(data = infectionSig_data,
x = "area", y = "area.infected",
hue = "percent.infected", # Set the point-colour by percent infected
palette = sns.color_palette(palette = "viridis_r",
as_cmap = True), ## 3.2.7 Use a better colour palette like viridis
edgecolor = "black", # Set the point border colour so we can see them all
style = "fixing.date", # 3.2.6 Change the style of our points based on variable values
alpha = 0.7,
height = 6, aspect = 0.5,
kind = "scatter",
col = "strain" # Split the columns of the grid by country
)
# Set the axis titles (Aesthetics)
snsPlot.set_axis_labels(x_var = "Total area", y_var = "Area infected")
# Change your y-axis to a log scale
snsPlot.set(yscale="log")
# Show the plot (avoid extra object information)
plt.show()
3.3.0 Plotting theoretical distributions¶
Now that we have some of the basics, it's time to take a closer look at using other types of plots. Let's return to our embryo data in embryo_merged_subset. It has a lot of nice population-based data that we can dissect to look at theoretical distributions.
3.3.1 More on filtering your data with the boolean AND operator: &¶
We'll focus our dataset first by filter for just the N2 strain in the mock infection condition. To accomplish this we'll use the conditional AND & operator which can combine our boolean expressions. What we skipped last time around was that to can accompany this, you can also use the conditional OR | and we've already seen the logical NOT ~ which can convert our boolean to it's opposites.
If we are generating this code using .loc conditional selection, each boolean expression must be separately enclosed by the parentheses syntax ( ). We'll talk more about this next time in Lecture 05. For now, we'll continue to use the .query() method as before and save the filtered data into the object N2_mock_data.
# Filter the data by N2 animals with a pathogenDose of 0
N2_mock_data = embryo_merged_subset.query('wormStrain == "N2" & pathogenDose == 0')
# Filter code for using loc instead:
# N2_mock_data = embryo_merged_subset.loc[(embryo_merged_subset['wormStrain'] == "N2") &
# (embryo_merged_subset['pathogenDose'] == 0),
# :]
# Check on the data created
N2_mock_data.info()
<class 'pandas.core.frame.DataFrame'> Index: 720 entries, 503 to 10748 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 worm.number 720 non-null int64 1 date 720 non-null int64 2 wormStrain 720 non-null object 3 pathogenStrain 720 non-null object 4 pathogenDose 720 non-null float64 5 doseLevel 720 non-null object 6 timepoint 720 non-null object 7 merontsPresent 720 non-null bool 8 sporesPresent 720 non-null bool 9 numEmbryos 720 non-null int64 10 experiment 720 non-null object 11 experimenter 720 non-null object 12 description 720 non-null object 13 Infection Date 720 non-null int64 14 Plate Number 720 non-null int64 15 Total Worms 720 non-null int64 16 Spore Lot 720 non-null object 17 Lot concentration 720 non-null int64 18 Total ul spore 720 non-null float64 dtypes: bool(2), float64(2), int64(7), object(8) memory usage: 102.7+ KB
3.3.2 Look at your distribution as a kernel density estimate with kdeplot() or displot()¶
There are a lot of datapoints in our new dataset. This time our measurements are based on the numEmbryos for specific uninfected worms within different replicate experiments. A quick question we already investigated was to identify the overall distributions of our embryo numbers across different replicates for all of our strains. We can, however, also visualize this data as a distribution.
A quick way to answer this question is by generating a kernel density estimate (KDE) using the displot() method from seaborn. We'll need to provide an x value (numEmbryos in this case) and we'll colour our plots based on the date variable which is coded as a series of integers which should represent our separate experimental replicates.
# We'll build a KDE plot
snsPlot = sns.displot(data = N2_mock_data,
x = "numEmbryos",
hue = "date",
height = 6, aspect = 2,
kind = "kde",
fill = True, # The fill parameter is passed on to kdeplot()
)
# Show the plot (avoid extra object information)
plt.show()
Why do we only have 2 colours in our distribution plot? We can see all of the various dates, BUT only the first is really coloured and the others all have a darker hue. It is very reminiscent of when we coloured our scatterplot by the hue parameter as well. However, should the values in date be considered as numbers or as separate groups?
3.3.3 The category data type reassigns order or meaning to your values¶
Sometimes when we work with our data, we may produce what look like numerical values for a variable like a replicate number or serial number. However, these values aren't just numbers but really grouping values just like we have in our wormStrain variable. Python/pandas doesn't differentiate on this idea inherently when passing data around because it has no insight into our intentions. Moreover when we work with the seaborn package - it will also treat integers and floats just like integers and floats.
Recall in section 3.3.1 we looked at the dtype values in our DataFrame and the date variable was an int64. What we want is some other way to represent this data. We could convert it to a string str BUT we'll introduce a better dtype called the category.
Categorical variables are very handy when working with statistical analysis and help to define groups but can also give them an order of importance. This means when analysing or plotting the data, this specific order can be used to determine how that data is used or displayed.
To start off simple, let's convert our date variable to a category data type and see how that affects our KDE plot.
# Convert our date variable with the astype() method
N2_mock_data = N2_mock_data.astype({'date':'category'})
# rebuild a KDE plot
snsPlot = sns.displot(data = N2_mock_data,
x = "numEmbryos",
hue = "date",
height = 6, aspect = 2,
kind = "kde",
fill = True, # The fill parameter is passed on to kdeplot()
)
# Show the plot (avoid extra object information)
plt.show()
It worked! Now we can see that each date has been given a distinct colour!
3.3.3.1 Reverse the order of a list using slicing notation¶
So right now we can see the order of our data has been set by the date values with our earliest date (190426) on top and our latest date (200918) on the bottom. When defining categorical data, you can also define the order of your data. To help us out, we'll grab an array of the unique elements in our date variable and then we can use the slicing notation [::-1] to reverse our array.
# Take the unique values and flip the order
date_list = N2_mock_data.date.unique()[::-1].tolist()
# View the reversed date list
date_list
[200916, 200905, 200918, 200915, 200904, 200825, 200821, 200721, 200714, 200707, 190426]
3.3.3.2 Reorder your categories with reorder_categories()¶
Now that we have a reverse list of our categories (or you could make a custom list of course), you can replace the date column in our dataset. To alter the categorical information, you must access the .cat property and use the .reorder_categories() method on it. This will return a new categorical object which you must use to replace the original date data.
We'll take a look at our updated KDE plot afterwards and see if it worked!
# We'll need to pull out and replace the date column
# In this case we will not "pop" the column out but work on it as part of the DataFrame
N2_mock_data['date'] = (N2_mock_data['date']
# access the categorical property
.cat
# Reorder the categorical data
.reorder_categories(new_categories = date_list)
)
# Double-check the date column results
N2_mock_data['date']
503 190426
504 190426
505 190426
506 190426
507 190426
...
10744 200916
10745 200916
10746 200916
10747 200916
10748 200916
Name: date, Length: 720, dtype: category
Categories (11, int64): [200916, 200905, 200918, 200915, ..., 200721, 200714, 200707, 190426]
# Rebuild a KDE plot with the new category order
snsPlot = sns.displot(data = N2_mock_data,
x = "numEmbryos",
hue = "date", # Now the date variable is a category!
height = 6, aspect = 2,
kind = "kde",
fill = True, # The fill parameter is passed on to kdeplot()
)
# Show the plot (avoid extra object information)
plt.show()
3.3.4 Add a rugplot() to the margin¶
Within seaborn there are a few ways marginal plots that can be added to your visualizations. Marginal plots usually add distribution summaries like a histogram, kde or in our case, a rugplot. More specifically, we'll be using the .rugplot() method but there is also the ability to create certain plot combinations using the .jointplot() method.
A rugplot is simply a series of vertical or horizontal tick-marks representing our actual data points along the x- and/or y-axis. For our rugplot, we'll add it outside of our plot area by manipulating the parameters:
height: determine how "tall" our rug plot is as a proportion of the plot. Positive values fall within the axis vs negative values.clip_on: boolean parameter to denote if the plot should cut off objects falling outside the axes limits (True) or let them remain rendered (False) outside the plot axes.alpha: a common parameter to set which determines the opacity of plot points. In this case, it will make viewing the density of our tick marks a little better.
Despite where it is plotted, we are adding this plot on the underlying Axes object of the current snsPlot figure.
# We'll build a KDE plot
snsPlot = sns.displot(data = N2_mock_data,
x = "numEmbryos",
hue = "date", # Now the date variable is a category!
height = 6, aspect = 2,
kind = "kde",
fill = True, # The fill parameter is passed on to kdeplot()
)
## 3.3.4 Add a rugplot - this is plotted on top of the current axes object
sns.rugplot(data = N2_mock_data,
x = "numEmbryos",
height = -0.02, # Draw the rug plot BELOW the x-axis
clip_on = False, # Ensure that it is rendered
alpha = 0.5
)
# Show the plot (avoid extra object information)
plt.show()
3.4.0 Boxplots provide visual summary statistics of your data¶
Boxplots are categorical type plots and a great way to visualize summary statistics for your data. As a reminder, the thick line in the center of the box is the median. The upper and lower ends of the box are the first and third quartiles (or 25th and 75th percentiles) of your data. The whiskers extend to the largest value no further than 1.5*IQR (inter-quartile range - the distance between the first and third quartiles).
Data beyond these whiskers are considered outliers and plotted as individual points. This is a quick way to see how comparable your samples or variables are.

We are going to use boxplots to see the distribution of embryos per worm strain across all uninfected worm samples.
3.4.1 Generate a basic boxplot with the catplot() function¶
To build the basic boxplot we begin with the main variables. We want to summarize the mean embryo values distribution of each worm strain across all replicate experiments. We can use the catplot() method to build our visualization. You'll note that the plot is not automatically coloured to differentiate between the x-axis groups.
The catplot() method is the gateway to more categorical plots and behaves similarly to the other two figure-level plots we've encountered.
Before we begin, we'll summarize the data again as we did earlier in section 1.3.0 but this time we'll also include an extra level to group by in doseLevel.
We'll again use the in keyword in our query() to help us filter and use the .agg() method to calculate the mean for each of our subgroups.
# filter by doseLevel
mean_embryo_data = (embryo_merged_subset.query('doseLevel in ["Mock", "Medium"]')
# Group by infection experiment
.groupby(by = ['date', 'wormStrain', 'pathogenStrain', 'doseLevel'])
# Create the frequency table on numEmbryos
# ['numEmbryos']
.agg({'numEmbryos':'mean'})
# Recall, when we reset the index, it converts the indices back into columns
.reset_index()
# Set our wormStrain and doseLevel variables as categories!
.astype({'wormStrain':'category', 'doseLevel':'category'})
)
# Check out our summarized data
mean_embryo_data
| date | wormStrain | pathogenStrain | doseLevel | numEmbryos | |
|---|---|---|---|---|---|
| 0 | 190426 | AB1 | LUAm1 | Medium | 8.583333 |
| 1 | 190426 | AB1 | LUAm1 | Mock | 10.903226 |
| 2 | 190426 | CB4856 | ERTm5 | Medium | 17.484375 |
| 3 | 190426 | CB4856 | ERTm5 | Mock | 20.750000 |
| 4 | 190426 | ED3042 | LUAm1 | Medium | 2.352941 |
| ... | ... | ... | ... | ... | ... |
| 138 | 200916 | N2 | ERTm5-96H | Mock | 20.140000 |
| 139 | 200918 | AWR144 | ERTm5 | Mock | 20.060000 |
| 140 | 200918 | AWR145 | ERTm5 | Mock | 21.760000 |
| 141 | 200918 | JU1400 | ERTm5 | Mock | 14.140000 |
| 142 | 200918 | N2 | ERTm5 | Mock | 21.360000 |
143 rows × 5 columns
Now we can build our boxplot using the summarized data!
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = mean_embryo_data,
x = "wormStrain", y = "numEmbryos",
kind = "box", # Make a boxplot
height = 6, aspect = 2 # Set the width of our plot to 4 and width at 12 (6x2)
)
# Show the plot (avoid extra object information)
plt.show()
3.4.2 Rotate your x-axis through the set_xticklabels() method¶
We've encountered this problem before when working with matplotlib.pyplot and the solution is actually the same. Recall that seaborn is built upon the back of the pyplot module so we can actually modify the plot directly through pyplot!
We'll rotate the x-axis text with the set_xticklabels() method on our plot object to 90 degrees. At the same time, let's update the hue parameter to colour our boxes by wormStrain.
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = mean_embryo_data,
x = "wormStrain", y = "numEmbryos",
kind = "box", # Make a boxplot
height = 6, aspect = 2,
hue = "wormStrain", # Colour our wormStrains
)
## 3.4.2 Alter the xtick attributes
snsPlot.set_xticklabels(rotation = 90)
# Show the plot (avoid extra object information)
plt.show()
3.4.4 Generate nested boxplots using the hue parameter¶
We already know that our data is measured across two dose levels (Mock and Medium) so we can take advantage of this information to create a nested (paired/grouped) set of boxplots. When you have a smaller number of categories, this allows you to more directly compare the characteristics of your two populations. Let's see what happens when we use the hue parameter to distinguish between our doseLevel subsets.
We'll also set the boxplot() parameter width to put a little more distance between each category along the x-axis. This value usually ranges between 0 and 1.
To save on some space we'll additionally move the legend into the plot using the legend_out boolean parameter.
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = mean_embryo_data,
x = "wormStrain", y = "numEmbryos",
kind = "box", # Make a boxplot
height = 6, aspect = 2,
hue = "doseLevel", ## 3.4.4 Set the hue by doseLevel
legend_out = False, ## 3.4.4 Move the legend inside the plot
width = 0.6 ## 3.4.4 Put some more distance between categories by decreasing width
)
# Alter the xtick attributes
snsPlot.set_xticklabels(rotation = 90)
# Show the plot (avoid extra object information)
plt.show()
Let's take a moment to fix our doseLevel category so that we plot the "Mock" data before the infected data for our boxplot. Remember, we can use the .reorder_categories() method to accomplish this
# Pull out our doseLevel data and replace it with a newly categorized version
mean_embryo_data['doseLevel'] = (mean_embryo_data['doseLevel']
.cat
.reorder_categories(new_categories = ["Mock", "Medium"])
)
# Double check it worked
mean_embryo_data['doseLevel'].cat.categories
Index(['Mock', 'Medium'], dtype='object')
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = mean_embryo_data,
x = "wormStrain", y = "numEmbryos",
kind = "box", # Make a boxplot
height = 6, aspect = 2,
hue = "doseLevel", # Set the hue by doseLevel
legend_out = False, # Move the legend inside the plot
width = 0.6 # Put some more distance between categories by decreasing width
)
# Alter the xtick attributes
snsPlot.set_xticklabels(rotation = 90)
# Show the plot (avoid extra object information)
plt.show()
3.4.5 Facet your crowded data with the row and col parameters¶
So our plot above is quite busy. Whereas previously, we used the relplot() method to generate a faceted scatterplot (or relational plot), here we'll use the catplot() method to accomplish something similar. The catplot() method handles the distribution or faceting of categorical plots using similar parameters:
col: The variable name that will be used to group the columns of your grid.row: The variable name that will be used to group the rows of your grid.
This time around we'll facet our data into a stacked set of plots using the row parameter. You'll notice that since we are no longer grouping by hue, the legend will also disappear.
At the same time, we'll play with a few additional methods:
set_axis_labels(): can accept an x and y-axis label in the form of a string. You can also set separate labels with theset_xlabelsandset_ylabelsmethods.set_xticklabels(): we already used this to rotate our x-axis text but we also can set thelabelsparameter.set_titles(): we'll use this to simplify the name of each panel to use just the value from our variable. This is found in string"{col_name}"."{col_var}"would use the variable name we're setting our facet by, ie. "wormStrain".
sns.set(font_scale = 4)
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = mean_embryo_data,
x = "doseLevel", y = "numEmbryos",
kind = "box", # Make a boxplot
height = 10, aspect = 1,
width = 0.6, # Put some more distance between categories by decreasing width
col = "wormStrain", ## 3.4.5 Set our facets based on wormStrain
col_wrap = 5 ## 3.4.5 Only use 5 facets per row
)
## 3.4.5 Alter the xtick attributes
snsPlot.set_xticklabels(labels = ["Mock", "Medium"])
## 3.4.5 Change our y-axis label
snsPlot.set_axis_labels("dose level", "embryos per animal")
## 3.4.5 set the title
snsPlot.set_titles("{col_name}")
# Show the plot (avoid extra object information)
plt.show()
3.4.6 Overlay on a faceted plot using map_dataframe()¶
Even though boxplots give us summary statistics on our data, it is useful to readers (and reviewers) to be able to see where our individual data points are. We've already used rugplot() to help visualize our data distribution in density plots. In that case, we simply plotted on top of the already present simple plot.
Similarly, for a boxplot we can add the data as another layer using using an sns.swarmplot() to place dots on top of our boxplot. A swarmplot places data points that are overlapping next to each other, so we can get a better picture of the distribution of our data.
In the case of our facted_boxplot however, we cannot simply overlay with sns.swarmplot(). Instead, we need to map our data using the .map_dataframe() method. It will preserve the underlying panel/graph and use it's characteristics to overlay a new plot, potentially with a different DataFrame. It uses the following parameters:
func: the function we want to overlay. This would be thesns.swarmplotin our case. Note the lack of parentheses!argsandkwargs: we'll talk more about this in Lecture 5, but essentially any additional arguments you would normally use for your plotting function, are simply supplied here as named parameters.
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = mean_embryo_data,
x = "doseLevel", y = "numEmbryos",
kind = "box", # Make a boxplot
height = 10, aspect = 1,
width = 0.6, # Put some more distance between categories by decreasing width
hue = "doseLevel", ## 3.4.6 Re-colour our boxes by doseLevel again!
col = "wormStrain",
col_wrap = 5,
fliersize = 0 ## 3.4.6 Hide our outliers by making them size 0
)
# Alter the xtick attributes
snsPlot.set_xticklabels(labels = ["Mock", "Medium"])
# set the title
snsPlot.set_titles("{col_name}")
## 3.4.6 Overlay a swarmplot on our catplot
snsPlot.map_dataframe(sns.swarmplot, data = mean_embryo_data,
x = "doseLevel",
y = "numEmbryos",
hue = "doseLevel", # split the points by dose level just like nested boxplots
palette="dark:black", # Recolour all of the points to black
size=15 # Set the size of our points so they can be seen
)
## 3.4.6 Change our y-axis label. It appears to work best AFTER the map_dataframe() call
snsPlot.set_axis_labels("dose level", "embryos per animal")
# Show the plot (avoid extra object information)
plt.show()
3.5.0 Use the violin plot to visualize distributions¶
If you could combine aspects of the boxplot and the KDE into a single visualization, you would think it was the violin plot. Another way to think of the violin plot is as the KDE plot that's been shrunk down and placed categorically.
It's actually quite easy to switch over since many of the aspects are similar to the boxplot. We need only change the kind parameter in our catplot() code. The fliersize parameter will also be removed since this no longer applies to our plot and will throw a warning.
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = mean_embryo_data,
x = "doseLevel", y = "numEmbryos",
kind = "violin", ## 3.5.0 Make a violin plot
height = 10, aspect = 1,
width = 0.6, # Put some more distance between categories by decreasing width
hue = "doseLevel", # Re-colour our violins by doseLevel again!
col = "wormStrain",
col_wrap = 5
)
# Alter the xtick attributes
snsPlot.set_xticklabels(labels = ["Mock", "Medium"])
# set the title
snsPlot.set_titles("{col_name}")
# Overlay a swarmplot on our catplot
snsPlot.map_dataframe(sns.swarmplot, data = mean_embryo_data,
x = "doseLevel",
y = "numEmbryos",
hue = "doseLevel", # split the points by dose level just like nested boxplots
palette="dark:black", # Recolour all of the points to black
size=15 # Set the size of our points so they can be seen
)
# Change our y-axis label. It appears to work best AFTER the map_dataframe() call
snsPlot.set_axis_labels("dose level", "embryos per animal")
# Show the plot (avoid extra object information)
plt.show()
3.5.1 Combine a binary grouping with the split parameter¶
Sometimes a more direct comparison of your data can be applied through the violin plot by generating a split version of it. This is especially helpful when you are working with nested data that is binary and you would like to compare it visually.
We'll initialize this visualization with the split boolean parameter. To help with this visualization we'll also:
- change the overlay from a
swarmplotto astripplotto accommodate a more narrow width of the half-violins - include some
innermarkers of the quartile information in each violin - Use the
density_normparameter to ensure our violins are the samewidth(for easier viewing). Other scaling options for this areareaandcount. - switch up the colours here with the
paletteparameter which accepts a dictionary-like object too!
sns.set(font_scale = 1)
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = mean_embryo_data,
x = "wormStrain", y = "numEmbryos",
kind = "violin", # Make a boxplot
height = 6, aspect = 2,
hue = "doseLevel",
split = True, ## 3.5.1 This will create hybrid violin plots
inner = "quart", ## 3.5.1 Add quartile markers to each half of the violin
density_norm="width", ## 3.5.1 Make all the widths the same
palette = {"Mock":"green", "Medium":"yellow"} ## 3.5.1 Update the palette of our violins
)
## 3.5.1 Overlay a stripplot on our catplot
snsPlot.map_dataframe(sns.stripplot, data = mean_embryo_data,
x = "wormStrain",
y = "numEmbryos",
hue = "doseLevel", # split the points by dose level just like nested boxplots
palette="dark:black", # Recolour all of the points to black
size=5, # Set the size of our points so they can be seen
dodge = True # Reduce the overlap on any points
)
# Change our y-axis label. It appears to work best AFTER the map_dataframe() call
snsPlot.set_axis_labels("dose level", "embryos per animal")
# Alter the xtick attributes
snsPlot.set_xticklabels(rotation = 90)
# Show the plot (avoid extra object information)
plt.show()
4.0.0 Saving your plots to file¶
Up until now, we have taken for granted that our plots have been displayed using a Graphic Device. For our Jupyter Notebooks we can see the graphs right away and update our code. You can even save them manually from the output display but sometimes you may be producing multiple visualizations based on large data sets. In this case it is preferable to save them directly to file.
4.1.0 Save your figures with plt.savefig() method¶
Once you have a figure the way you want it, you can save in any number of graphical and non-graphical formats. The savefig() method from the pyplot package is here to save the day. Saving the current figure, you can use some of the following parameters:
fname: The path to your file you want to save, including the extension. Ifformatis not set, then the file extension will be used to infer the format instead.dpi: The resolution in dots per inch for your figure.format: The file format you'd like to use. Supported filetypes include svg, jpg, eps, and pdf.
Let's save our faceted scatterplot from way back in section 3.2.6.
# Use catplot to make our boxplot
snsPlot = sns.catplot(data = mean_embryo_data,
x = "wormStrain", y = "numEmbryos",
kind = "violin", # Make a boxplot
height = 6, aspect = 2,
hue = "doseLevel",
split = True, # This will create hybrid violin plots
inner = "quart", # Add quartile markers to each half of the violin
density_norm="width", # Make all the widths the same
palette = {"Mock":"green", "Medium": "yellow"} # Update the palette of our violins
)
# Overlay a swarmplot on our catplot
snsPlot.map_dataframe(sns.stripplot, data = mean_embryo_data,
x = "wormStrain",
y = "numEmbryos",
hue = "doseLevel", # split the points by dose level just like nested boxplots
palette="dark:black", # Recolour all of the points to black
size=5, # Set the size of our points so they can be seen
dodge = True
)
# Change our y-axis label. It appears to work best AFTER the map_dataframe() call
snsPlot.set_axis_labels("dose level", "embryos per animal")
# Alter the xtick attributes
snsPlot.set_xticklabels(rotation = 90)
# Save the plot
plt.savefig("data/mean_embryos_byStrain.png",
format = "png",
dpi = 300
)
5.0.0 Plotting multi-panel figures¶
Up until now we've been generating a combination of either faceted plots or simply layering elements upon single plots. Throughout all of these we have not really been mixing the types of plots we've generated. Luckily for us, the matplotlib.pyplot module provides a means for us to put together multiple plot axes in a single figure.
We have already seen some of these figure-level functions in action with relplot() and catplot() which provided an interface to axes-level methods like scatterplot() or boxplot when creating faceted plots.
![]() |
|---|
| A handy figure from the seaborn overview at: https://seaborn.pydata.org/tutorial/function_overview.html |
What if, however, we would like to create a figure with multiple axes of different types?
5.1.0 Use subplot2grid() to generate a multi-grid plot¶
When generally considering the layout of our data, we want to think of breaking up the figure into a grid. This can start off simply as a 1x1 panel and expand outwards with nested panels within a 2x2 or 3x3 or larger figure. The dimensions also are not limited to square shapes but can be rectangular as well.
The pyplot.subplot2grid() function takes the following parameters:
shape: The dimensions of the figure given in a (numRow, numCol) tuple. This is essentially the backdrop of all the panels.loc: The location of the subplot (Axes object) you are creating in relation to the base figure. Position (0,0) is the top left corner.rowspan: The dimensions of the subplot in number of rows.colspan: The dimensions of the subplot in number of columns.fig: A figure object to place the Axes object in. Otherwise the current figure is used.
![]() |
|---|
| Some simple layouts demonstrating how a figure can be subdivided |
Let's make a figure with 3 plots as seen in the 4th example above. We'll begin just by generating the specific layout.
# Initialize a figure
fig = plt.figure()
# Generate 3 subplots onto our current figure
ax1 = plt.subplot2grid(shape = (2, 2), loc = (0,), colspan = 1, rowspan = 2)
ax2 = plt.subplot2grid(shape = (2, 2), loc = (...), colspan=1)
ax3 = plt.subplot2grid(shape = (2, 2), loc = (...), colspan=1)
plt.show()
5.1.1 Use the Axes objects as a canvas to plot onto fig¶
Now you can see that our figure encompasses the 3 panels that we envisioned. You'll notice that we named each panel with a reference/variable. We could have also added them to a single list object to save on variable names but to simplify our understanding they've been named separately.
Why do we need these objects? Each Axes-level function we use to plot with, can take a parameter ax where we pass along an Axes object. This identifies which panel we want to plot onto in our overall figure. Otherwise it will use the last Axes object generated (ie ax3 was the last one we generated in our above code). We'll fill our Axes as follows:
ax1: A scatterplot object from our infection signal dataax2: Our split-violin plot of the mean embryo values across strainsax3: A kde plot of the uninfected embryo counts per strain
Each of these is a matplotlib.axes.Axes object
There is, however, a quick hitch. We can no longer rely directly on the figure-level functions of relplot, catplot and displot. In order to plot correctly on these subpanels, we'll need to use the axes-level function versions instead. We'll add our plots one at a time so you can see the effect of each.
# Initialize a figure
fig = plt.figure(figsize=(15, 15))
# Generate 3 subplots onto our current figure
ax1 = plt.subplot2grid((2, 3), (0, 0), colspan=1, rowspan=2)
ax2 = plt.subplot2grid((2, 3), (0, 1), colspan=2)
ax3 = plt.subplot2grid((2, 3), (1, 1), colspan=2)
# ---------- Plot 1 ----------#
## 5.1.1 Add the first plot - scatterplot
# We'll build a scatterplot
...(data = infectionSig_data,
x = "area",
y = "area.infected",
hue = "strain", # Set the point-colour by strain
alpha = 0.6, # Set the transparency of the points
ax = ... # Set this plot to ax1 within the fig object
)
# Directly set the labels on the x-axis and y-axis in ax1
...set_xlabel("total area")
...set_ylabel("area infected")
# Show the plot
plt.show()
5.1.2 Add multiple layers to the second Axes¶
Next we'll populate the second panel (top-right) with a combination of violinplot() and swarmplot(). We'll adjust the x-axis tick labels through the .tick_params() method as we will be dealing with the matplotlib.axes.Axes object.
# Initialize a figure
fig = plt.figure(figsize=(15, 15))
# Generate 3 subplots onto our current figure
ax1 = plt.subplot2grid((2, 3), (0, 0), colspan=1, rowspan=2)
ax2 = plt.subplot2grid((2, 3), (0, 1), colspan=2)
ax3 = plt.subplot2grid((2, 3), (1, 1), colspan=2)
# ---------- Plot 1 ----------#
# Add the first plot - scatterplot
# We'll build a scatterplot
sns.scatterplot(data = infectionSig_data,
x = "area",
y = "area.infected",
hue = "strain", # Set the point-colour by strain
alpha = 0.6, # Set the transparency of the points
ax = ax1 # Set this plot to ax1 within the fig object
)
# Directly set the labels on the x-axis and y-axis of ax1
ax1.set_xlabel("total area")
ax1.set_ylabel("area infected")
# ---------- Plot 2 ----------#
# Add the second plot - violin plot
## 5.1.2 Use violinplot to make our split violin
...(data = mean_embryo_data,
x = "wormStrain", y = "numEmbryos",
hue = "doseLevel",
split = True, # This will create hybrid violin plots
inner = "quartile", # Add quartile markers to each half of the violin
palette = {"Mock":"green", "Medium": "yellow"}, # Update the palette of our violins
density_norm="width", # Make all the widths the same
alpha = 0.6,
ax = ... # Set this plot to ax2 within the fig object
)
# Overlay a swarmplot on our catplot
sns.stripplot(data = mean_embryo_data,
x = "wormStrain",
y = "numEmbryos",
hue = "doseLevel", # split the points by dose level just like nested boxplots
palette="dark:black", # Recolour all of the points to black
size=5, # Set the size of our points so they can be seen
dodge = True,
ax = ...
)
# Directly set the labels on the x-axis and y-axis of ax2
ax2.set_xlabel("worm strain")
ax2.set_ylabel("embryos per animal")
# 5.1.2 Alter the xtick attributes
ax2...(axis = 'x', rotation = 90)
# Show the plot
plt.show()
5.1.3 Move or remove the legend from your plot with various methods¶
So you can see there's a slight issue above with our legend in the violin plot. Due to plotting both a violin and strip plot together, we get a legend with both the colouring and the points. The points from the strip plot don't have much meaning so we can remove those easily in the call to sns.stripplot() by using the legend = False parameter.
If you wanted to remove the legend altogether, you could use the .get_legend().remove() command. This would access the Axes object legend and remove it.
You can also move the legend around with the seaborn.move_legend() function which requires 2 parameters:
obj: the axes object (plot) you want to move aroundloc: the location you wish to place the legend. This can be in the form of a combination of upper/center/lower and left/center/right.
# Initialize a figure
fig = plt.figure(figsize=(15, 15))
# Generate 3 subplots onto our current figure
ax1 = plt.subplot2grid((2, 3), (0, 0), colspan=1, rowspan=2)
ax2 = plt.subplot2grid((2, 3), (0, 1), colspan=2)
ax3 = plt.subplot2grid((2, 3), (1, 1), colspan=2)
# ---------- Plot 1 ----------#
# Add the first plot - scatterplot
# We'll build a scatterplot
sns.scatterplot(data = infectionSig_data,
x = "area",
y = "area.infected",
hue = "strain", # Set the point-colour by strain
alpha = 0.6, # Set the transparency of the points
ax = ax1 # Set this plot to ax1 within the fig object
)
# Directly set the labels on the x-axis and y-axis of ax1
ax1.set_xlabel("total area")
ax1.set_ylabel("area infected")
# ---------- Plot 2 ----------#
# Add the second plot - violin plot
# Use violinplot to make our split violin
sns.violinplot(data = mean_embryo_data,
x = "wormStrain", y = "numEmbryos",
hue = "doseLevel",
split = True, # This will create hybrid violin plots
inner = "quartile", # Add quartile markers to each half of the violin
palette = {"Mock":"green", "Medium": "yellow"}, # Update the palette of our violins
density_norm="width", # Make all the widths the same
alpha = 0.6,
ax = ax2 # Set this plot to ax2 within the fig object
)
sns.move_legend(ax2, ...) ## 5.1.3 Move the legend to the lower right portion of the plot
# Overlay a swarmplot on our catplot
sns.stripplot(data = mean_embryo_data,
x = "wormStrain",
y = "numEmbryos",
hue = "doseLevel", # split the points by dose level just like nested boxplots
palette="dark:black", # Recolour all of the points to black
size=5, # Set the size of our points so they can be seen
dodge = True,
legend = False, ## 5.1.3 Set the legend to false to remove it from the plot
ax = ax2
)
# Directly set the labels on the x-axis and y-axis of ax2
ax2.set_xlabel("worm strain")
ax2.set_ylabel("embryos per animal")
# Alter the xtick attributes
ax2.tick_params(axis = 'x', rotation = 90)
# Show the plot
plt.show()
5.1.4 Add a KDE plot to our final panel¶
Let's complete the set by adding a KDE plot to our final panel. We'll filter our dataset on the fly as we pass it to the kdeplot() function.
# Initialize a figure
fig = plt.figure(figsize=(15, 15))
# Generate 3 subplots onto our current figure
ax1 = plt.subplot2grid((2, 3), (0, 0), colspan=1, rowspan=2)
ax2 = plt.subplot2grid((2, 3), (0, 1), colspan=2)
ax3 = plt.subplot2grid((2, 3), (1, 1), colspan=2)
# ---------- Plot 1 ----------#
# Add the first plot - scatterplot
# We'll build a scatterplot
sns.scatterplot(data = infectionSig_data,
x = "area",
y = "area.infected",
hue = "strain", # Set the point-colour by strain
alpha = 0.6, # Set the transparency of the points
ax = ax1 # Set this plot to ax1 within the fig object
)
# Directly set the labels on the x-axis and y-axis of ax1
ax1.set_xlabel("total area")
ax1.set_ylabel("area infected")
# ---------- Plot 2 ----------#
# Add the second plot - violin plot
# Use violinplot to make our split violin
sns.violinplot(data = mean_embryo_data,
x = "wormStrain", y = "numEmbryos",
hue = "doseLevel",
split = True, # This will create hybrid violin plots
inner = "quartile", # Add quartile markers to each half of the violin
palette = {"Mock":"green", "Medium": "yellow"}, # Update the palette of our violins
density_norm="width", # Make all the widths the same
alpha = 0.6,
ax = ax2 # Set this plot to ax2 within the fig object
)
sns.move_legend(ax2, "lower right") ## 5.1.3 Move the legend to the lower right portion of the plot
# Overlay a swarmplot on our catplot
sns.stripplot(data = mean_embryo_data,
x = "wormStrain",
y = "numEmbryos",
hue = "doseLevel", # split the points by dose level just like nested boxplots
palette="dark:black", # Recolour all of the points to black
size=5, # Set the size of our points so they can be seen
dodge = True,
legend = False, ## 5.1.3 Set the legend to false to remove it from the plot
ax = ax2
)
# Directly set the labels on the x-axis and y-axis of ax2
ax2.set_xlabel("worm strain")
ax2.set_ylabel("embryos per animal")
# Alter the xtick attributes
ax2.tick_params(axis = 'x', rotation = 90)
# ---------- Plot 3 ----------#
# 5.1.4 rebuild a KDE plot
...(data = N2_mock_data,
x = "numEmbryos",
hue = "date",
fill = True, # The fill parameter is passed on to kdeplot()
ax = ax3 # Set this plot to ax2 within the fig object
)
# Directly set the x-axis label
ax3.set_xlabel("mean embryos per animal")
# Show the plot
plt.show()
5.1.5 Adjust spacing between Axes using tight_layout()¶
Nearly there, we can see there are some overlapping text issues from the y-axis labels of the KDE plot onto the scatter plot beside it. This is due to the size of the y-tick text itself pushing the title too far left. We can ask the plot to fix these spacing issues using the plt.tight_layout() method, which should fix the overlapping issues as best as it can.
# Initialize a figure
fig = plt.figure(figsize=(15, 15))
# Generate 3 subplots onto our current figure
ax1 = plt.subplot2grid((2, 3), (0, 0), colspan=1, rowspan=2)
ax2 = plt.subplot2grid((2, 3), (0, 1), colspan=2)
ax3 = plt.subplot2grid((2, 3), (1, 1), colspan=2)
# ---------- Plot 1 ----------#
# Add the first plot - scatterplot
# We'll build a scatterplot
sns.scatterplot(data = infectionSig_data,
x = "area",
y = "area.infected",
hue = "strain", # Set the point-colour by strain
alpha = 0.6, # Set the transparency of the points
ax = ax1 # Set this plot to ax1 within the fig object
)
# Directly set the labels on the x-axis and y-axis of ax1
ax1.set_xlabel("total area")
ax1.set_ylabel("area infected")
# ---------- Plot 2 ----------#
# Add the second plot - violin plot
# Use violinplot to make our split violin
sns.violinplot(data = mean_embryo_data,
x = "wormStrain", y = "numEmbryos",
hue = "doseLevel",
split = True, # This will create hybrid violin plots
inner = "quartile", # Add quartile markers to each half of the violin
palette = {"Mock":"green", "Medium": "yellow"}, # Update the palette of our violins
density_norm="width", # Make all the widths the same
alpha = 0.6,
ax = ax2 # Set this plot to ax2 within the fig object
)
sns.move_legend(ax2, "lower right") ## 5.1.3 Move the legend to the lower left portion of the plot
# Overlay a swarmplot on our catplot
sns.stripplot(data = mean_embryo_data,
x = "wormStrain",
y = "numEmbryos",
hue = "doseLevel", # split the points by dose level just like nested boxplots
palette="dark:black", # Recolour all of the points to black
size=5, # Set the size of our points so they can be seen
dodge = True,
legend = False, ## 5.1.3 Set the legend to false to remove it from the plot
ax = ax2
)
# Directly set the labels on the x-axis and y-axis of ax2
ax2.set_xlabel("worm strain")
ax2.set_ylabel("embryos per animal")
# Alter the xtick attributes
ax2.tick_params(axis = 'x', rotation = 90)
# ---------- Plot 3 ----------#
# rebuild a KDE plot
sns.kdeplot(data = N2_mock_data,
x = "numEmbryos",
hue = "date",
fill = True, # The fill parameter is passed on to kdeplot()
ax = ax3 # Set this plot to ax2 within the fig object
)
# Directly set the x-axis label
ax3.set_xlabel("mean embryos per animal")
## 5.1.5 Re-adjust the axes to remove overlap due to axis text
...
# Show the plot
plt.show()
Not too shabby! And we're done!
6.0.0 Class summary¶
That's our fourth class on Python! You've made it through and we've learned about taking advantage of in-built DataFrame methods for exploratory data analyis as well as how to finally visualize some of your data:
- Exploratory data analysis with
groupby()and aggregation functions. - Basic plots with the
matplotlib.pyplotmodule. - Advanced visualizations with the
seabornpackage. - Saving your figures to file
- Plotting multi-panel
seabornfigures through thepyplotpackage.
6.1.0 Submit your completed skeleton notebook (2% of final grade)¶
At the end of this lecture a Quercus assignment portal will be available to submit your completed skeletons from today (including the comprehension question answers!). These will be due one week later, before the next lecture. Each lecture skeleton is worth 2% of your final grade but a bonus 0.5% will also be awarded for submissions made within 24 hours from the end of lecture (ie 1600 hours the following day).
- From the Jupyter Notebook, select
File> Save Notebookor usectrl + S - Return to the Jupyter folder page (ie click on the top-left "jupyterhub" icon)
- Locate your .ipynb file and click on the check box to its left side.
- If it is still active it will have a green icon beside its name. Above the file list will be option to
Shut Down. After clicking on this, you may re-select the file and above the file list chooseDownloadto save to your local hard drive. - Upload your .ipynb file to the appropriate Quercus assignment portal.
![]() |
|---|
| A sample screen shot for saving your completed skeleton notebooks |
6.2.0 Post-lecture DataCamp assessment (5% of final grade)¶
Soon after the end of each lecture, a homework assignment will be available for you in DataCamp. Your assignment is to complete the Introduction to Data Visualization with Seaborn course (4 chapters, 4200 possible points). This is a pass-fail assignment, and in order to pass you need to achieve a least 3150 points (75%) of the total possible points. Note that when you take hints from the DataCamp chapter, it will reduce your total earned points for that chapter.
In order to properly assess your progress on DataCamp, at the end of each chapter, please print a PDF of the summary. You can do so by following these steps:
- Navigate to the
Learnsection along the top menu bar of DataCamp. This will bring up a sidebar menu where you can access the various courses you have been assigned underAssignmentsor you can scroll down on the page to find theMy Assignmentssection. Click on the relevant chapter or course and this may bring you back to within the course itself. - You should now be on a Course Summary page (See figure below). You can expand each chapter of the course by clicking on the
VIEW CHAPTER DETAILSlink. DO THIS FOR ALL COMPLETED SECTIONS ON THE PAGE! - Select the visible text in the course Window (not from any top or side menus) using a click and drag technique. Using something like Ctrl+A will not properly select the course text.
- Print the page from your browser menu and make sure you have "Selection Only" from the Settings options on the Print dialogue. Save as a single PDF. Note: if you don't select the text area correctly (at least in Google Chrome) you may not be able to print the full page. Your results should look something like this:
![]() |
|---|
| A sample screen shot for one of the DataCamp assignments. You'll want to try and print off a single PDF of this section from Learn > My Assignments |
You may need to take several screenshots if you cannot print it all in a single try. Submit the file(s) or a combined PDF for the homework to the assignment portal on Quercus. By submitting your scores for each section, and chapter, we can keep track of your progress, identify knowledge gaps, and produce a standardized way for you to check on your assignment "grades" throughout the course.
You will have until 12:59 hours on Thursday, February 6th to submit your assignment (right before the next lecture).
6.3.0 Acknowledgements¶
Revision 1.0.0: materials prepared for CSB1021H S LEC0140, 01-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.2.0: edited and prepared for CSB1021H S LEC0140, 01-2023 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.2.2: edited and prepared for CSB1021H S LEC0140, 01-2024 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.2.3: edited and prepared for CSB1021H S LEC0140, 01-2025 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
7.0.0 Appendix 1 - In case your kernel should crash¶
In the case that your kernel crashes, you can use this code cell to recreate all of the data that is used in section 5.0.0. Convert the below cell into a coding cell by using the "Y" key when highlighting it in "Command" mode. Then simply run the cell and it should recreate the datasets needed.
The Center for the Analysis of Genome Evolution and Function (CAGEF)¶
The Centre for the Analysis of Genome Evolution and Function (CAGEF) at the University of Toronto offers comprehensive experimental design, research, and analysis services in microbiome and metagenomic studies, genomics, proteomics, and bioinformatics.
From targeted DNA amplicon sequencing to transcriptomes, whole genomes, and metagenomes, from protein identification to post-translational modification, CAGEF has the tools and knowledge to support your research. Our state-of-the-art facility and experienced research staff provide a broad range of services, including both standard analyses and techniques developed by our team. In particular, we have special expertise in microbial, plant, and environmental systems.
For more information about us and the services we offer, please visit https://www.cagef.utoronto.ca/.




